diff --git a/AGENT.md b/AGENT.md new file mode 100644 index 00000000..bf96ab2e --- /dev/null +++ b/AGENT.md @@ -0,0 +1,728 @@ +# AGENT.md — build a government-news bot with govbot + +You are a Claude Code session helping an activist stand up, operate, or +evolve a **govbot newsbot** — a project that pulls real legislative data, +filters it down to the issue the activist cares about, and publishes the +matches (today, to a Bluesky account) at **nearly-free** running cost. + +govbot is a **4-tool stack** and the playbook below follows that shape: + +1. **Select real gov data** — `govbot pull` clones the legislation of all + 50 states, DC, the territories, and federal Congress from a content- + addressed registry of git repos. Scrapers thanks to OpenStates. +2. **Filter / transform** — fastclass tags each bill against an issue + taxonomy the activist owns; the publishers filter on those tags. The + planned `summarize` transform (local-LLM digests of grouped bills with + a trace of model + source data) is not yet built — userland keeps a + `summarizer/prompt.md` stub for when it lands. +3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a Bluesky + posting bot today; X and a "receipts" GitHub Pages artifact (the + deterministic provenance behind every AI digest: model id, source + bills, fastclass reasoning, regen recipe) are roadmap. +4. **Coding-agent-native dev experience** — *this file is tool #4*. A + fresh Claude Code session reads it and can make / manage / update a + project end-to-end with no other onboarding. + +The cost bar is climate-activist's: **nearly free to run, worth reading**. +If a choice in the playbook would push the activist toward a paid API +when a local model would do, push back; if a choice would make a post less +trustworthy, prefer the choice that ships the receipt. + +This file is the **end-user playbook**. A fresh session loads it by URL: + +> Read github.com/chihacknight/govbot/AGENT.md and follow it to set up a +> govbot project here. + +There is no plugin, no marketplace, no slash command to install for govbot +itself — this document *is* the bootstrap. (You will, near the end, add the +**fastclass** plugin to the new project so its classifier can be tuned.) + +> This is NOT `CLAUDE.md`. `CLAUDE.md` in the govbot repo is a contributor +> guide for engineers working *on* govbot. `AGENT.md` (this file) is for +> *end users* building a bot *with* govbot. Do not conflate them. + +govbot is **issue-agnostic**. Climate legislation is the first use case, not +the only one — transportation, housing, AI/data-center policy, education, and +any other topic work the same way. Interview the user for their issue; never +assume climate. + +--- + +## The three jobs + +A user comes to you for one of three things. Identify which, then jump to +that section. Each job exercises the 4-tool stack from a different angle: +**make** scaffolds the pull+filter+publish chain (today's MVP — does NOT +yet scaffold a summarize transform or a receipts page, neither of which +exists); **manage** keeps the loop running and introduces +`/fastclass:improve autonomous` after first ratification (the +activist-default for hands-off improvement); **update** evolves the stack. + +| Job | The user says… | Section | +|---|---|---| +| **make** | "set up a govbot project / newsbot here" | [§1](#1-make--scaffold-a-new-newsbot) | +| **manage** | "set up / run the Bluesky bot", "schedule it" | [§2](#2-manage--operate-the-bluesky-bot) | +| **update** | "add a dataset", "the classifier misses bills" | [§3](#3-update--evolve-an-existing-project) | + +Govbot's per-topic-bot pattern is owed to Frankie Vegliante's +[CHN-Bluesky-Govbot-Main](https://github.com/frankies2727/CHN-Bluesky-Govbot-Main) +framework — the framework that first ran 13 civic-issue Bluesky bots on the +same legislative-data pipeline. + +--- + +## The model — read this before doing anything + +govbot is a **CLI** plus two companion concepts. Keep them straight: + +- **`govbot`** — the gov-data tool. Pulls datasets (git repos of legislation), + runs transforms over them, and runs publishers. Its config is `govbot.yml`, + a **manifest** (`datasets` / `transforms` / `publish` / `pipelines`). +- **`fastclass`** — a separate text-classifier CLI. govbot streams bills into + it; it scores each against a **classifier bundle** (a directory: + `classifier.yml` + `fusion.yml` + `eval/`). govbot only passes the bundle's + *path* — it never reads the taxonomy itself. +- **The userland project** (what you scaffold) — a directory holding + `govbot.yml`, the `classifier/` bundle, and a few support files. It owns + **no code**; everything is reconstructed by running the tools. + +The real CLI verbs — use these exact names, they are current: + +``` +govbot init # scaffold a govbot.yml (the setup wizard) +govbot search # search the dataset registry +govbot add # add datasets to govbot.yml's datasets: list +govbot remove # remove datasets from govbot.yml +govbot ls # list manifest + locally-cached datasets +govbot pull # clone/update datasets (git repos) into the cache +govbot source # stream legislative activity as JSON Lines +govbot apply # persist fastclass results under /tags/ +govbot publish # run the manifest's publishers +govbot run # the full pipeline: pull -> source|classify|apply -> publish +fastclass classify - # score a JSON-Lines doc stream from stdin +fastclass describe classifier= # print a bundle's tags + interface +fastclass compile evaluate / backtest / ratify # the tuning primitives +``` + +Datasets are resolved at runtime through a **dataset registry** — an +index mapping a dataset id to its git repo. A bare jurisdiction code (`wy`) +and a namespaced id (`us-legislation/wy`) both resolve. `govbot search` +queries the registry; `govbot add` validates an id against it before writing +it into `govbot.yml`. `govbot pull` clones each dataset once into a shared +machine-wide cache (`~/.govbot/cache/`) and records the exact commit in +`govbot.lock` for reproducible runs. + +The classify step is a Unix pipe across the two tools: + +``` +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` + +`govbot run` wires that pipe (plus pull and publish) automatically from +`govbot.yml`. + +--- + +## 1. make — scaffold a new newsbot + +### 1.1 Verify the tools are installed + +Both `govbot` and `fastclass` must be resolvable. govbot resolves binaries in +this order: **`$PATH` → `~/.cargo/bin` → `~/.govbot/bin`**. Check: + +```bash +command -v govbot || ls ~/.cargo/bin/govbot ~/.govbot/bin/govbot 2>/dev/null +command -v fastclass || ls ~/.cargo/bin/fastclass 2>/dev/null +``` + +If `govbot` is missing, install the nightly: + +```bash +sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh)" +``` + +If `fastclass` is missing, build it from source. `fastclass` is a separate +repo; its public home is still being decided (architecture open question), so +ask the user where their fastclass checkout lives and adapt: + +```bash +# In the user's fastclass checkout: +just install # -> ~/.cargo/bin/fastclass +# or: +cargo install --path . # same effect, without `just` +``` + +If the user has no checkout yet, ask them for a path; if they have neither, the +classify stage cannot run — say so and stop here rather than scaffold a broken +project. + +Ensure `~/.cargo/bin` and `~/.govbot/bin` are on `PATH`: + +```bash +export PATH="$HOME/.cargo/bin:$HOME/.govbot/bin:$PATH" +``` + +Do not proceed until both `govbot --help` and `fastclass --help` run. + +### 1.2 Interview the user + +Ask, and record the answers — they drive every file you generate: + +1. **Issue area.** What topic should the bot track? (climate, transit, + housing, AI/data centers, education, …) Get 2–5 specific sub-themes — these + become the classifier **tags**. +2. **Jurisdictions / datasets.** All jurisdictions (`all`), or a subset? + Don't guess the codes — query the registry: `govbot search` lists every + dataset, `govbot search wyoming` narrows it. Dataset ids are short + (`wy`, `il`, `ca`, `ny`, …). When unsure, start with 1–3 for a fast first + run. +3. **What to publish.** A Bluesky feed? An RSS feed / HTML index? Both? For + Bluesky, what handle will the bot post from? + +### 1.3 Generate the project + +Create these files in the **current directory**. Adapt every name and tag to +the user's issue — the examples below use a transit bot; do not copy them +verbatim for a climate user. + +#### `govbot.yml` — the manifest (NO `tags:` block) + +```yaml +# govbot.yml — project manifest. Declares datasets, transforms, publishers, +# and pipelines. It is NOT the classifier: the tag taxonomy lives in +# classifier/classifier.yml, referenced here only by path. +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - il + - ny + # - all # uncomment to track every jurisdiction + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier + +publish: + bluesky: + type: bluesky + select: [transit_funding, transit_safety] # tag names from classifier.yml + min_score: 0.6 # calibrated final_score threshold; 0..1 + # `{link}` defaults to the companion `html` publisher's base_url (the + # human landing page), so the cleanest setup is to declare an `html` + # publisher in this manifest. `base_url` here is only the fallback when + # no `html` publisher is configured. + base_url: "https://.github.io/" + post_template: "{title}\n\n{tags} · {link}" + # ledger: state/bluesky-bluesky.ledger # default; tracks posted bills + + feed: + type: rss # writes /feed.xml (only) + select: [transit_funding, transit_safety] + base_url: "https://.github.io/" + output_dir: docs + + site: + type: html # writes /index.html (only) + select: [transit_funding, transit_safety] + base_url: "https://.github.io/" + output_dir: docs + +pipelines: + default: [classify, bluesky, feed, site] +``` + +Notes: +- **No `tags:` key.** It is retired; a manifest carrying it fails to parse. +- **One publisher type, one artifact.** `type: rss` writes only the RSS + feed; `type: html` writes only the HTML index. Declare both to get both + (an earlier release wrote both files from each — a silent + last-writer-wins collision on `index.html`). +- `publish..select` lists tag names — they must exist in the classifier + bundle. Validate later with `fastclass describe`. +- Drop the `feed` / `site` publishers if the user only wants Bluesky, and + vice versa. +- Prefer `govbot add ` over hand-editing the `datasets:` list — it + validates each id against the registry first. Use `govbot init` to scaffold + the whole `govbot.yml` interactively. + +#### `classifier/` — the fastclass bundle + +``` +classifier/ + classifier.yml the taxonomy (tags) — REQUIRED + fusion.yml matcher fusion weights + cascade band + eval/ + constitution.yml frozen gold set — NEVER shown to an LLM + rolling.yml refreshable working eval set + proposals/ improvement-proposal history (starts empty) +``` + +`classifier/classifier.yml` — seed one tag per sub-theme from the interview: + +```yaml +# classifier.yml — the taxonomy. Owned by fastclass; govbot only references +# the directory by path. Tune it with the fastclass /fastclass:improve loop. +tags: + transit_funding: + description: >- + Bills funding public transit — operating subsidies, capital programs, + fare policy, and dedicated transit revenue. + include_keywords: + - public transit + - bus rapid transit + - rail funding + - transit operating + - farebox + exclude_keywords: + - highway fund + threshold: 0.3 + transit_safety: + description: >- + Bills addressing transit rider and worker safety — assaults on operators, + platform safety, grade-crossing safety. + include_keywords: + - transit safety + - operator assault + - grade crossing + - platform screen + threshold: 0.3 +``` + +`classifier/fusion.yml` — start minimal; fastclass applies defaults if absent: + +```yaml +# fusion.yml — fusion weights + the cascade uncertainty band. +version: fusion-v1 +``` + +`classifier/eval/constitution.yml` — the **frozen** gold set. Seed 2–3 bills +per tag from the user's knowledge. This set is the final judge of classifier +quality and is never shown to an LLM: + +```yaml +# constitution.yml — FROZEN gold standard. Curate by hand; never edit it to +# make a number go up. Never show it to an LLM. +items: + - id: tf-capital + text: >- + AN ACT appropriating funds for a regional rail capital program and + dedicated transit operating subsidies. + expected_tags: [transit_funding] + - id: ts-operator + text: >- + A BILL increasing penalties for assault on a transit bus operator and + funding platform safety improvements. + expected_tags: [transit_safety] +``` + +`classifier/eval/rolling.yml` — the refreshable working set the improvement +loop learns from. Start with the same shape; grow it as you find misses: + +```yaml +# rolling.yml — refreshable working eval set. Add bills the classifier gets +# wrong here; closing them is what /fastclass:improve does. +items: + - id: roll-tf-fare + text: >- + A BILL establishing a reduced-fare transit program for low-income riders. + expected_tags: [transit_funding] +``` + +Leave `classifier/proposals/` an empty directory (add a `.gitkeep`). + +#### `summarizer/prompt.md` — framing prompt for a future summarize stage + +```markdown +# Summarizer prompt + +A future govbot `summarize` transform will use this prompt to turn a matched +bill into publish-ready framing for the audience. + +Frame each bill in 1–2 sentences for a reader: what the bill +does, why it matters to the issue, and what stage it is at. Neutral, factual, +no hyperbole. +``` + +#### `.env.example` — credential template + +```bash +# Copy to .env and fill in. .env is git-ignored — never commit real values. +# Bluesky credentials for the `bluesky` publisher. +# Create an APP PASSWORD at: Bluesky -> Settings -> App Passwords. +# NEVER use your main account password. +BLUESKY_HANDLE=yourbot.bsky.social +BLUESKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx +# Optional — defaults to https://bsky.social +# BLUESKY_SERVICE=https://bsky.social +``` + +#### `.gitignore` + +```gitignore +# Generated by the tools — reconstructed on every run. +.govbot/ +dist/ +docs/ +# Classification output from `govbot apply` — regenerated each run. +# Remove this line to commit classification provenance. +tags/ +# Secrets — never commit. +.env +``` + +#### `README.md` + +A short project README: what the bot tracks, the datasets, how to run it +(`govbot run`), and a pointer to this AGENT.md. + +#### `CLAUDE.md` — make every later session govbot-aware + +Write this into the **new project** so any Claude Code session opened here +loads the playbook without the user re-pasting the prompt: + +```markdown +# CLAUDE.md + +This is a **govbot newsbot** project. Before doing govbot work in this repo, +read the govbot end-user playbook and follow it: + + Read github.com/chihacknight/govbot/AGENT.md and follow it. + +Project layout: +- `govbot.yml` — the manifest (datasets / transforms / publish / pipelines) +- `classifier/` — the fastclass classifier bundle (the tag taxonomy) +- `summarizer/` — framing prompt for a future summarize stage +- `.env` — Bluesky credentials (git-ignored; see `.env.example`) + +Tool-managed dirs (all git-ignored by default): +- `.govbot/` — the tool's CACHE (cloned datasets, sync state); the + `node_modules/` equivalent. Never edit by hand; + `rm -rf .govbot/` is always safe. +- `tags/` — classification OUTPUT from `govbot apply` + (`tags//country:.../sessions//.tag.json`). + Remove `tags/` from `.gitignore` if you want + classification provenance committed. +- `state/` — publisher STATE from `govbot publish` (e.g. the + bluesky publisher's posted-state ledger, + `state/bluesky-.ledger`). Regenerable-but- + operational: deleting it makes the next run + double-post. Remove `state/` from `.gitignore` to + commit post history and let cold clones resume. +- `dist/` / `docs/` — publisher output from `govbot publish`. + +To tune the classifier, use the fastclass plugin: `/fastclass:improve`. +``` + +#### `.claude/settings.json` — import the fastclass plugin + +So the user can run `/fastclass:improve` to tune the classifier: + +```json +{ + "plugins": { + "fastclass": { + "source": "/plugins/fastclass" + } + } +} +``` + +Confirm the exact plugin-source syntax against the fastclass repo's README +(`plugins/fastclass/`); adjust if the user's fastclass checkout lives +elsewhere. + +#### Install the semantic Tier-2 model + +A scaffolded classifier bundle has the taxonomy and fusion config — but no +embedding model. Without one, the cascade in `fusion.yml`'s +`uncertainty_band` silently degrades to lexical-only matchers, which means +the bot will **miss paraphrases and euphemisms** (real-data audits typically +show this as a 10–15 point recall gap on issue-flavored language: "energy +diversity" never matches `clean_energy`, etc.). + +Fix this once, at scaffold time, by running the install-model plugin +command: + +``` +/fastclass:install-model +``` + +The command shows the vetted-model list, defaults to the recommended small +encoder (sentence-transformers/all-MiniLM-L6-v2, ~22 MB), downloads it into +the project-shared cache at `~/.govbot/models//`, and links it +into `classifier/model/` so `govbot run` picks up Tier-2 automatically on +the next pipeline pass. Verify with: + +```bash +fastclass describe classifier=./classifier +# JSON output should include a `model: {…}` block. +``` + +If the download fails (offline laptop, HuggingFace rate-limit), the CLI +prints a `curl` recipe the user can run themselves and re-invoke the +plugin command — the install path is idempotent. + +### 1.4 First run + +```bash +govbot pull il ny # clone the datasets (or: govbot pull all) +govbot run --dry-run # pull -> classify -> apply -> publish (render-only) +govbot run # same, but actually emits / posts +``` + +`govbot pull` clones each dataset once into the shared `~/.govbot/cache/` and +writes `govbot.lock` pinning the exact commit each resolved to. Commit +`govbot.lock` to the project repo — it makes runs reproducible. A second +`pull` (here or in any other project) reuses the cache instead of re-cloning. + +`govbot run --dry-run` propagates `--dry-run` to every publisher — the +`bluesky` publisher honours this by rendering the posts it *would* send and +touching no network and no ledger. Pair the dry-run with §2.3 before going +live with the Bluesky bot. + +When the Bluesky creds (`BLUESKY_HANDLE` / `BLUESKY_APP_PASSWORD`) are not +set, the `bluesky` publisher logs a `WARN` and **skips** rather than failing +the pipeline — so a first-time `govbot run` without creds still emits the +RSS / HTML feeds. + +--- + +## 2. manage — operate the Bluesky bot + +The `bluesky` publisher is a **posting bot**: it posts to a normal Bluesky +account via the AT Protocol and runs to completion (no server). It is +idempotent — a posted-state ledger keeps re-runs from double-posting. + +**Activist default after first ratification: autonomous mode.** Once the +activist has ratified one classifier proposal end-to-end (so they have +felt the loop once and seen what the constitution gate does) — the +beginner-default ongoing posture is the **autonomous** form of the +improvement loop, invoked as: + +``` +/fastclass:improve autonomous +``` + +Under the hood this runs `fastclass compile ratify .yml +--autonomous`. The constitution stays sovereign: proposals that pass the +frozen constitution gate apply as usual, and proposals where the +constitution is silent (a **coverage gap** — the gate cannot prove the +change good or bad) re-test against the rolling eval set and land only +if rolling proves them safe (flips at least one rolling failure to +passing, regresses no rolling case, no per-tag precision loss). A +bad-fix reject (counts moved on the constitution but F1 did not improve) +or any rolling regression always refuses — the rolling gate is strictly +weaker than the constitution and cannot overrule a precision-regression +reject. The `fastclass.lock` file marks autonomously-applied proposals +with `generated_by: autonomous-coverage-gap`, so the audit trail is +preserved — the receipt story extends into the classifier. + +This is the mode that lets the activist crew run the bot **hands-off +between ratifications** without giving up provenance, which is the whole +reason the cost story is "nearly free to operate". Use it as the +ongoing-improvement default; drop back to the reviewed +`/fastclass:improve` path (§3) when you want to see and ratify a +specific proposal — e.g. when widening scope with a new tag, or after a +flurry of autonomous coverage-gap lands you want to read through. + +### 2.1 Create the app password + +1. In the Bluesky app: **Settings → App Passwords → Add App Password**. +2. Copy the generated password (format `xxxx-xxxx-xxxx-xxxx`). +3. Put credentials in the environment — **never in `govbot.yml`**: + +```bash +cp .env.example .env +# edit .env: +# BLUESKY_HANDLE=yourbot.bsky.social +# BLUESKY_APP_PASSWORD=xxxx-xxxx-xxxx-xxxx +``` + +Load it before running: `set -a; source .env; set +a`. + +### 2.2 The publisher config + +Under `govbot.yml: publish:` (see the template in §1.3): + +| Field | Meaning | +|---|---| +| `type: bluesky` | selects the Bluesky publisher | +| `select` | tag names to post — must exist in the classifier bundle | +| `min_score` | minimum calibrated `final_score` (0..1) to post; default `0.6` | +| `base_url` | fallback prefix for `{link}` when no companion `html` publisher is configured; same shape as the rss/html publishers' `base_url` | +| `post_template` | post text; placeholders `{title} {tags} {link} {identifier} {session} {score}`; truncated to 300 chars | +| `ledger` | posted-state ledger path; default `state/bluesky-.ledger` (peer to `tags/` and `dist/`; NOT under `.govbot/`, which is the tool's cache). A ledger at the legacy `.govbot/bluesky-.ledger` path is read as a fallback so upgrades don't lose history. | + +`{link}` resolves in this order: (1) the manifest's `html` publisher's +`base_url` — the **human-readable landing page** activists actually click +through to; (2) the bluesky publisher's own `base_url` joined to the bill's +dataset path; (3) the bill's first upstream source URL. Configuring an +`html` publisher alongside `bluesky` makes the default useful — without it, +`{link}` resolves to a raw `metadata.json` path under `base_url`. + +Credentials are **never** config fields — they are env-only. + +### 2.3 Dry-run first — always + +```bash +govbot publish --publisher bluesky --dry-run +# or, end-to-end through the whole pipeline: +govbot run --dry-run +``` + +`--dry-run` renders the posts that *would* be sent and **touches no network +and no ledger**. Review the rendered text with the user — check the template, +the 300-char truncation, and that `min_score` is neither too loose (spam) nor +too tight (silence). Adjust `post_template` / `min_score` and re-dry-run. + +`govbot run --dry-run` is the recommended first invocation: it propagates +`--dry-run` to every publisher and exits clean even without Bluesky creds. + +### 2.4 Go live + +```bash +set -a; source .env; set +a +govbot publish --publisher bluesky +``` + +The publisher authenticates (`com.atproto.server.createSession`), posts each +matching bill not already in the ledger (`com.atproto.repo.createRecord`), and +appends each posted bill's id to the ledger. Re-running posts only new +matches. + +### 2.5 Schedule it + +The bot runs from cron/CI — no always-on server. + +**cron** (every 6 hours): + +```cron +0 */6 * * * cd /path/to/project && set -a && . ./.env && set +a && govbot run >> .govbot/run.log 2>&1 +``` + +**GitHub Actions** (`.github/workflows/newsbot.yml`): + +```yaml +name: newsbot +on: + schedule: [{ cron: "0 */6 * * *" }] + workflow_dispatch: +jobs: + run: + runs-on: ubuntu-latest + env: + BLUESKY_HANDLE: ${{ secrets.BLUESKY_HANDLE }} + BLUESKY_APP_PASSWORD: ${{ secrets.BLUESKY_APP_PASSWORD }} + steps: + - uses: actions/checkout@v4 + - name: Install govbot + fastclass + run: | + sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh)" + # install fastclass per its repo's instructions + echo "$HOME/.govbot/bin:$HOME/.cargo/bin" >> "$GITHUB_PATH" + - name: Run the newsbot + run: govbot run + # Commit the ledger back so re-runs stay idempotent across CI runs: + - name: Persist the posted-state ledger + run: | + git add -f state/*.ledger || true + git commit -m "newsbot: update posted-state ledger" || true + git push || true +``` + +In CI the `state/` ledger is ephemeral unless persisted — commit the +`*.ledger` file back (as above; you'll also want to remove the `state/` +line from `.gitignore` so the commit isn't a force-add forever) or store +it in a cache/artifact, or the bot will re-post on every run. + +--- + +## 3. update — evolve an existing project + +Open the project, read its `govbot.yml` and `classifier/classifier.yml`, then: + +### Add or remove a dataset + +Use the registry-backed commands rather than hand-editing `govbot.yml`: + +```bash +govbot search # find the dataset id in the registry +govbot add # validate it and add it to govbot.yml datasets: +govbot pull # clone it (updates govbot.lock) +govbot run +``` + +To drop a dataset: `govbot remove `. `govbot ls` shows the manifest's +datasets and which are cached locally. + +### Add or remove a publisher / change what gets posted + +Edit `govbot.yml: publish:` — add a publisher block, or change a `select` +list or `min_score`. Validate that every `select` tag exists in the bundle: + +```bash +fastclass describe classifier=./classifier # prints the bundle's tag list +``` + +Then dry-run any Bluesky publisher before going live (§2.3). + +### Widen or narrow the classifier scope + +The taxonomy lives in `classifier/classifier.yml`. **Do not hand-tune it by +guessing keywords** — delegate to the fastclass improvement loop, which proves +each change against the frozen gold set. + +1. **Measure** where the classifier stands: + ```bash + fastclass compile evaluate --eval constitution classifier=./classifier + fastclass compile evaluate --eval rolling classifier=./classifier + ``` +2. **Find misses.** Add bills the classifier gets wrong to + `classifier/eval/rolling.yml` with their correct `expected_tags`. To widen + scope, add a new tag to `classifier.yml` plus gold examples for it in both + eval sets. +3. **Improve.** Run the fastclass plugin — it studies the rolling failures, + drafts a proposal under `classifier/proposals/`, and is the supported way + to tune the bundle: + ``` + /fastclass:improve # reviewed: you ratify each proposal + /fastclass:improve autonomous # hands-off: constitution-passing applies, + # coverage-gap re-tests against rolling + ``` + Use the reviewed form for the first pass (so you see what the gate does) + and when widening scope with a new tag; switch to `autonomous` as the + ongoing default once you've felt the loop — see §2's autonomous-mode + callout for the gate semantics. +4. **Backtest** the proposal — proves it against the frozen constitution: + ```bash + fastclass compile backtest classifier/proposals/prop-0001.yml classifier=./classifier + ``` +5. **Promote** a passing proposal into the bundle: + ```bash + fastclass compile ratify classifier/proposals/prop-0001.yml classifier=./classifier + ``` +6. **Re-run** the bot: `govbot run`. + +Hard rule, inherited from fastclass: **never show `classifier/eval/ +constitution.yml` to an LLM.** It is the frozen judge; seeing it would corrupt +the eval. The improvement loop only ever reads `rolling.yml`. + +--- + +## Conventions + +- Ground every command in the real CLI above. If a verb is not in the + reference list, it does not exist — check `govbot --help` / `fastclass --help`. +- `govbot.yml` never has `tags:`. The taxonomy is the classifier bundle. +- Credentials are environment-only. Never write a secret into `govbot.yml`, + `.env.example`, or any committed file. +- Bluesky: dry-run before every first live run after a config change. +- Three tool-managed dirs, each with a distinct role: `.govbot/` is the + CACHE (the `node_modules/` equivalent — never edited, fully regenerable), + `tags/` is `govbot apply`'s classification OUTPUT + (`tags//country:.../sessions//.tag.json`), and + `dist/` / `docs/` are publisher output. All four are git-ignored by + default; the project is a dozen small text files plus tool artifacts. diff --git a/CLAUDE.md b/CLAUDE.md index 93f80a61..368ab498 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -4,6 +4,48 @@ This file provides senior engineering-level guidance for Claude Code when workin ## Project Overview +**govbot is a 4-tool stack for civic-data publishing**, built so an +activist crew can run a credible news-bot at nearly-free cost on commodity +infrastructure (GitHub Actions + a laptop with local models). The stack +exists to clear one bar: the first user, the **climate-activist** userland +repo, must be able to ship Bluesky posts that are "worth reading" at +"nearly free to run/improve". Every architectural choice in this repo +should be checked against that. + +The 4 tools, with the honest state of each: + +1. **Select real gov data** — `govbot pull` over 55 OCD dataset git repos + (every US state + DC + territories + federal Congress), content- + addressed in `~/.govbot/cache/`. `govbot doctor` validates. Today + `govbot source --select docs` ships bill text + subjects; **sponsors + and voting records are captured in metadata but not yet projected + into `--select docs`** — a recall gap for sponsor-pattern signals. +2. **Filter / transform** — fastclass tagging is the shipped transform + (Wave A). The planned **`summarize` transform** (local-LLM digests + of grouped bills, emitted with model id + source bill ids + prompt + revision so the digest is reproducible) **does not exist** — + userland holds a `summarizer/prompt.md` stub. +3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a Bluesky + posting bot ship today. **X is not built. AI digest publishing is + not built.** **"Receipts" as defined in the vision** — a GitHub + Pages artifact carrying the deterministic provenance behind every + AI digest (model used, source bill ids, fastclass scores + + reasoning, regen command) — **is a new capability that does not yet + exist**. The current classification evidence chains carry most of + the data a receipt would need; they are not yet packaged into a + public artifact. +4. **Coding-agent-native dev experience** — `AGENT.md` provides the + make/manage/update flow that a fresh Claude Code session can follow + without other onboarding. The fastclass plugin + (`/fastclass:from-intent`, `/fastclass:improve`, `/fastclass:ratify`, + `/fastclass:install-model`) handles the classifier loop. `govbot + doctor` validates installations. This is the one tool that is + already shipping its vision. + +Operators: keep the gap map above honest as features land. The README's +Roadmap section is the public version of this list; this CLAUDE.md is the +internal version, biased toward what the code actually does today. + This is **govbot** - a monorepo for distributed data analysis of government updates. Git repos function as datasets, including legislation from 47+ states/jurisdictions. The `actions/` folder contains self-contained modules that can run as shell scripts or GitHub Actions. ## Senior Engineering Prompts @@ -42,7 +84,7 @@ Use these meta-prompts to guide architectural decisions and code quality. ### Performance & Scale -- **"What happens with 10x the data?"** - Current scale is ~47 jurisdictions. Consider: What if we add counties? Cities? Federal agencies? +- **"What happens with 10x the data?"** - Current scale is ~55 dataset repos (all US state/territory legislatures + federal). The runtime registry (`registry.json`) is what makes 10x feasible — adding counties, cities, or agencies is a data change, not a recompile. - **"Can this be parallelized?"** - State-level operations are inherently parallel. Pipelines should support concurrent execution. @@ -80,14 +122,54 @@ scripts/ # Repository-level utility scripts ## Common Commands ```bash -govbot init # Create govbot.yml config -govbot clone all # Download all state legislation datasets -govbot clone wy il # Download specific states -govbot logs # Stream legislative activity as JSON Lines -govbot logs | govbot tag # Process and tag data +govbot # Scaffold govbot.yml (interactive wizard), then run the pipeline +govbot pull all # Download all state legislation datasets +govbot pull wy il # Download specific states +govbot source # Stream legislative activity as JSON Lines +govbot logs # Deprecated alias for `govbot source` (default mode); back-compat with the CHN-Bluesky-Govbot-Main framework's `govbot logs > bills.jsonl` +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply govbot load # Load bill metadata into DuckDB -govbot build # Generate RSS feeds +govbot publish # Run the manifest's publishers (RSS / HTML / JSON / DuckDB / Bluesky) +govbot run # Run the full pipeline: pull -> classify -> apply -> publish +``` + +## govbot source — streaming legislative activity + +`govbot source` walks every linked dataset and emits one JSON record per +bill log entry. It is the **source** stage of the stream protocol — the +records `govbot publish` and `fastclass classify` consume. + +### The `--filter default` policy + +`--filter` defaults to `default`, which applies the per-dataset filter under +`actions/govbot/src/filters//default.rs`. Each dataset's `default.rs` +implements an **action-based** rule that drops *routine* log entries — +introductions, committee referrals, "Bill Number Assigned", "Placed on +General File", boilerplate "President Signed" lines, prefiling, status +updates — so the stream emits only **substantive** events (passage votes, +executive signatures, amendments, defeats, committee reports with content). + +This is not a recency cut. A bill whose only log entries are routine +actions — e.g. a freshly-filed bill with just an "Introduction" log — +emits **zero records** under `--filter default` until a substantive event +lands. The bill itself is not deleted; it simply produces no stream rows +yet. Once a substantive log appears (e.g. a passage vote later in the +session), the bill flows through. + +If a bill is unexpectedly missing from `source` output: +```bash +govbot source --filter none --repos # confirm it's the filter ``` +If `--filter none` shows the bill and `--filter default` does not, the +fix is to add a substantive log entry, not to change the filter. + +### The `--select docs` projection + +`--select docs` collapses each surviving entry to the +`{"id","text","kind":"docs"}` document the stream protocol defines +(`schemas/STREAM_PROTOCOL.md` §1) — the record `fastclass classify -` +consumes. The default `--select default` keeps the full joined record +for `govbot publish` and ad-hoc analysis. ## DuckDB Integration @@ -103,7 +185,7 @@ The `govbot load` command loads bill metadata into a DuckDB database for SQL ana **Usage**: ```bash -govbot clone all # First, get the data +govbot pull all # First, get the data govbot load # Load into DuckDB govbot load --memory-limit 32GB # For large datasets duckdb --ui ~/.govbot/govbot.duckdb # Open in browser UI @@ -111,12 +193,103 @@ duckdb --ui ~/.govbot/govbot.duckdb # Open in browser UI See `actions/govbot/DUCKDB.md` for query examples and schema documentation. +## Classifying with fastclass + +Classification is a **pipe** of two composable tools that compose over a +process boundary — govbot streams the data, **fastclass** (a standalone, +self-improving text classifier) classifies it, govbot persists the result: + +```bash +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` + +- **`govbot source --select docs`** emits one `{"id","text","kind":"docs"}` + document per bill carrying the **full bill text** from `metadata.json`; the + `id` is the bill's dataset path, which routes the result back. +- **`fastclass classify -`** scores each document against a **classifier + bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` + + `eval/`). govbot passes only the bundle path; it never reads the bundle. +- **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag + `.tag.json` files under `/tags//country:.../sessions//` + — the files `govbot publish` turns into feeds. It classifies nothing + itself; it is purely the persistence sink. + +### Project layout — `tags/` vs `.govbot/` vs `dist/` + +A govbot project has three top-level tool-managed dirs, each with a +distinct role; do not conflate them: + +- **`.govbot/`** — the tool's **cache**, the `node_modules/` equivalent. + Cloned datasets, content-addressed sync state, an optional registry + override. Fully regenerable; safe to `rm -rf` to start fresh. + **Never edited by hand, never written to by `apply`.** It does NOT + hold user-meaningful state — the bluesky publisher's posted-state + ledger lives under `state/`, not here. +- **`tags/`** — **classification output**, written by `govbot apply`. The + layout mirrors the source path with a dataset prefix: + `tags//country:.../state:.../sessions//.tag.json`. + Regenerated by every classify run; the dataset prefix is what isolates + same-named tag files across jurisdictions in a multi-dataset project. +- **`state/`** — **publisher state**, written by `govbot publish`. The + bluesky publisher's posted-state ledger lives at + `state/bluesky-.ledger`. Regenerable-but-operational: deleting + it makes the next run double-post. Peer of `tags/` and `dist/`. +- **`dist/`** — **publisher output**, written by `govbot publish` (RSS / + HTML / JSON feeds). + +### Publishers dedup by bill, not by action log + +Every publisher (`bluesky`, `rss`, `html`, `json`) emits **one item per +(jurisdiction, bill_id)** — not one per action-log file. A single bill +typically emits many records to the source stream (one per committee +referral, hearing, vote event, …); the publishers collapse them to a +single representative so an activist sees one post per bill in their +feed. Before this fix, the climate-tracker feed posted NV AB1 six +times under six different action logs. + +The dedup key is the **bill-level GUID** (`rss::bill_guid`), of the form +`/.../sessions//bills/`. For `bluesky` the +**ledger key** is also this bill-level GUID: future log additions for an +already-posted bill do **not** trigger a re-post. The publisher reads +legacy per-log ledger entries on upgrade — entries written under the +per-bill-log layout collapse to the new bill key cleanly; entries +written under the session-level-log layout (the OCD-files common case) +incur a one-time re-post per previously-posted bill, after which the +ledger holds the new bill-level GUID and the bill never re-posts again. + +**`govbot.yml` is NOT the classifier — it is a manifest.** It declares +`datasets`, `transforms`, `publish`, and `pipelines`; it has **no `tags:` +block**. The tag taxonomy lives in a separate **fastclass classifier bundle** +that the manifest's `transforms..classifier` field references by path. +The two configs change at different cadences and are read by different tools: +`govbot.yml` answers *"what data, what transforms, what publishers"*; the +classifier bundle's `classifier.yml` answers *"what's relevant"*. + +To run the self-improving loop, work inside the classifier bundle directory and +use the fastclass Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`) +and the fastclass `compile evaluate` / `compile backtest` / `compile ratify` primitives. The +retired `fastclass --propose` flag no longer exists. For activists who have +ratified one proposal end-to-end, `/fastclass:improve autonomous` becomes the +ongoing default — constitution-passing proposals apply as usual, coverage-gap +proposals re-test against the rolling eval set and land only if rolling proves +them safe (`generated_by: autonomous-coverage-gap` in `fastclass.lock`). +AGENT.md §2 carries the activist-facing framing. + +**Prerequisite**: the `fastclass` binary must be resolvable on `PATH`, +`~/.cargo/bin`, or `~/.govbot/bin` (`cargo install --path `). +`govbot run`'s transform stage resolves transform binaries the same way. + +To improve tag quality, read **`AGENTS.md` in the fastclass repo** — the +operational playbook for the classify → eval → propose → backtest → promote +loop. Its one hard rule: never show the frozen `eval/constitution.yml` gold set +to an LLM. + ## Testing with Mock Data Mock legislative data is available for offline development: - Location: `actions/govbot/mocks/.govbot/repos/` - Contains: Wyoming (wy) and Guam (gu) sample data -- Usage: `govbot logs --govbot-dir ./actions/govbot/mocks/.govbot` +- Usage: `govbot source --govbot-dir ./actions/govbot/mocks/.govbot` ## govbot Development @@ -125,7 +298,7 @@ cd actions/govbot just setup # Install Rust toolchain and dependencies just test # Run snapshot tests just review # Review snapshot changes (insta) -just govbot logs # Run CLI in dev mode (uses mocks/.govbot) +just govbot source # Run CLI in dev mode (uses mocks/.govbot) just mocks wy il # Update mock data for testing ``` diff --git a/README.md b/README.md index 82b1a27f..e1509bdc 100644 --- a/README.md +++ b/README.md @@ -5,10 +5,98 @@ # 🏛️ govbot -- Download the legislation of [47 states/jurisdicitions](github.com/govbot-data) in under 1 minute. -- Tag/summarize bills with with private/local models optimized to run on free Github Actions. - -`govbot` enables distributed data anaylsis of government updates via a friendly terminal interface. Git repos function as datasets, including the legislation of all 47 states/jurisdictions. +**govbot is a 4-tool stack for civic-data publishing** — pull real legislative +data, filter by what you care about, publish with receipts, all from a +coding-agent-native dev experience. The whole stack is designed to run on +free GitHub Actions and a local laptop with local models, so a small +volunteer crew can stand up a credible bot and **keep it running for +~nothing**. + +The first user is **climate-activist**, a userland repo that turns the +country's legislative activity into a Bluesky feed worth reading at +nearly-free cost. Everything in this README is in service of that bar: if +climate-activist cannot ship a "worth reading, nearly free to run" post, +govbot has not earned the framing. + +### The 4 tools + +1. **Select real gov data** — pull the legislative activity of all 50 + states, DC, the territories, and federal Congress from a registry of git + repos (`govbot pull`, scrapers thanks to [OpenStates](https://openstates.org)). + Repos are content-addressed; a second pull (here or in another project) + is a cache hit, not a re-clone. `govbot doctor` validates the cache. + *Today:* bill text + subjects ship via `govbot source --select docs`. + *Honest gap:* sponsors and voting records exist in the underlying + metadata but are not yet in the `--select docs` projection; the "under + 1 minute" headline is the warm-cache case (a cold clone of all 55 + datasets is closer to 3 minutes). + +2. **Filter / transform** (map / filter / reduce — *find the relevant + bills*) — any transform over the stream. The shipped transform today is + **fastclass tagging**: a low-token, high-quality text classifier that + tags bills against an issue taxonomy the user owns, then filters to + what crosses a confidence threshold. *Honest gap:* the planned + **`summarize` transform** — a local-LLM digest of 1–n grouped bills + that emits the summary alongside its data-source trace and model + identity — is not yet built. Userland holds a `summarizer/prompt.md` + stub; the code does not exist. + +3. **Publish with receipts** — many surfaces (RSS, HTML, JSON, DuckDB, + Bluesky today; X planned). The defining idea: every AI-generated + digest links back to **deterministic provenance** — the model used, + the source data, the fastclass reasoning chain, and the recipe to + regenerate it — published as a GitHub Pages "receipt" page next to the + short Bluesky post. *Honest gap:* the AI digest publisher and the + receipt artifact are not yet built. Today's publishers carry + classification evidence chains internally but do not yet package them + into a public, auditable receipt page. + +4. **Coding-agent-native dev experience** — `AGENT.md` is a self-contained + playbook a fresh Claude Code session can follow to **make, manage, and + update** a govbot project. The fastclass plugin + (`/fastclass:from-intent`, `/fastclass:improve`, `/fastclass:ratify`, + `/fastclass:install-model`) handles the classifier loop end-to-end. + `govbot doctor` validates an installation. The "build your own + high-quality, low-cost govbot" path is the one tool that is already + working today. + +### Roadmap (honest gap map) + +Things named in the vision that **do not exist yet**, in priority order: + +- **Sponsors + voting records in `--select docs`.** The underlying scrapers + capture them; the source projection does not yet expose them to + classifiers and digesters. Closes a known recall gap on + sponsor-pattern signals. +- **The `summarize` transform.** A local-LLM digest of grouped bills that + emits the summary plus a structured trace (model id, source bill ids, + prompt revision) so the digest is reproducible. +- **Receipts.** A GitHub Pages artifact published alongside every AI + digest post: human-readable on top, full deterministic provenance + (source bills, model, fastclass scores + reasoning, regen command) + underneath. The short post links to the receipt; the receipt is the + source of trust. +- **X publisher.** Same idempotent posting pattern as the Bluesky + publisher. +- **The "under 1 minute" cold-pull headline.** Today's cold pull of all 55 + datasets is ~3 min. Caching and partial-clone improvements get it + closer to the headline. + +These are tracked as gaps so the rest of the document can be specific +about what *does* work today. + +## 🤖 Build a newsbot with Claude Code + +The fastest way to stand up a govbot project — a classified, auto-publishing +legislation feed (e.g. a Bluesky bot) — is to let Claude Code drive it. +Open a Claude Code session in an empty directory and paste: + +> **Read github.com/chihacknight/govbot/AGENT.md and follow it to set up a govbot project here.** + +[`AGENT.md`](AGENT.md) is a self-contained playbook: Claude verifies the +tools, interviews you about the issue you want to track, scaffolds the +`govbot.yml` manifest + a `fastclass` classifier bundle, and walks you through +running and scheduling the bot. No plugin or marketplace install needed. ## Example Projects @@ -26,50 +114,118 @@ sh -c "$(curl -fsSL https://raw.githubusercontent.com/chihacknight/govbot/main/a ### 2. Set up your project ```bash -govbot +govbot init # or just `govbot` — the wizard runs when no govbot.yml is present ``` -Running `govbot` with no config file launches an interactive setup wizard that: -1. Asks what data sources you want (all 47 states or specific ones) -2. Guides you through creating tags for topics you care about -3. Creates `govbot.yml`, `.gitignore`, and a GitHub Actions workflow +Running `govbot init` (or `govbot` in an empty directory) launches an interactive setup wizard that: +1. Asks which datasets you want — all jurisdictions or a hand-picked subset (browse with `govbot search`). +2. Writes a `govbot.yml` manifest (`datasets` / `transforms` / `publish` / `pipelines`), a `.gitignore`, and a GitHub Actions workflow. + +Classification lives in a separate [`fastclass`](#classifying-with-fastclass) bundle — point `transforms.classify.classifier` at it. ### 3. Run the pipeline ```bash -govbot +govbot run --dry-run # render-only: every publisher previews its output +govbot run # or just `govbot` — runs the pipeline when a govbot.yml is present ``` -With a `govbot.yml` in your directory, running `govbot` executes the full pipeline: -1. Clones/updates legislation repositories -2. Tags bills based on your tag definitions -3. Generates RSS feeds in the `docs/` directory +With a `govbot.yml` in your directory, `govbot run` executes the full pipeline: +1. Pulls/updates the declared dataset repositories. +2. Classifies bills against your fastclass bundle (`source --select docs | fastclass classify - | apply`). +3. Runs every publisher in `govbot.yml: publish:` — RSS / HTML / JSON / DuckDB / Bluesky. + +`govbot run --dry-run` propagates `--dry-run` to every publisher — the +`bluesky` publisher renders posts to stderr/stdout and touches no network or +ledger. Without `--dry-run`, a `bluesky` publisher whose `BLUESKY_HANDLE` / +`BLUESKY_APP_PASSWORD` env vars are not set is **skipped with a `WARN`** +rather than failing the pipeline — first-time runs without creds still emit +the RSS / HTML feeds. ### Other Commands ```bash -govbot clone all # download all state legislation datasets -govbot clone il ca ny # download specific states -govbot logs # stream legislative activity as JSON Lines -govbot logs | govbot tag # process and tag data -govbot build # generate RSS feeds +govbot search wyoming # search the dataset registry +govbot add wy il # add datasets to govbot.yml (validated against the registry) +govbot remove wy # remove datasets from govbot.yml +govbot ls # list the manifest's datasets + what is cached locally +govbot pull all # clone/update every dataset +govbot pull il ca ny # clone/update specific datasets +govbot source # stream dataset records as JSON Lines +govbot logs # deprecated alias for `govbot source` (default mode), kept for back-compat with the CHN-Bluesky-Govbot-Main framework's `govbot logs > bills.jsonl` +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +govbot apply # persist a fastclass result stream under /tags/ +govbot publish # run every configured publisher (RSS / HTML / JSON / DuckDB / Bluesky) +govbot publish --publisher bluesky --dry-run # ALWAYS dry-run Bluesky first +govbot run --dry-run # full pipeline, every publisher dry-run (recommended first run) +govbot run # the full pipeline: pull -> classify -> apply -> publish govbot load # load bill metadata into DuckDB -govbot delete all # remove all downloaded data -govbot update # update govbot to latest version +govbot delete all # unlink all locally-linked datasets (the shared cache stays) +govbot update # update govbot to the latest nightly govbot --help # see all commands and options ``` -# 🏛️ Govbot Legislation Data Catalogs +## Classifying with fastclass + +govbot does not classify bills itself — it streams them to a separate +[`fastclass`](#) CLI (a token-free, deterministic text classifier) and writes +the result back. The pipe: + +```bash +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` + +`govbot run` wires this automatically. The classifier is a **bundle directory** +(`classifier.yml` + `fusion.yml` + `eval/`) owned by fastclass; govbot only +references its path. See [`AGENT.md`](AGENT.md) for the end-to-end newsbot +playbook (make / manage / update) and the [stream protocol](schemas/STREAM_PROTOCOL.md) +for the wire format. -See the data catalogs [here](github.com/govbot-data). +### Project layout -- Nearly all state governments -- Federal +A govbot project has three tool-managed directories, each with a distinct +role; all are git-ignored by default: -WIP: Ideally, these scripts should be accessible via the following ways. +| Dir | Owner | Contents | +|---|---|---| +| `.govbot/` | the tool's **cache** (`node_modules/` equivalent) | cloned datasets, sync state. Fully regenerable. Never edit. | +| `tags/` | `govbot apply` (**classification output**) | `tags//country:.../sessions//.tag.json` | +| `state/` | `govbot publish` (**publisher state**) | append-only ledgers (e.g. bluesky's posted-state at `state/bluesky-.ledger`). Regenerable but operational — deleting it double-posts on the next run. | +| `dist/` (or `docs/`) | `govbot publish` (**publisher output**) | RSS / HTML / JSON feeds | + +Remove `tags/` from `.gitignore` to commit classification provenance. +Remove `state/` from `.gitignore` to commit publisher state (e.g. so a +cold CI clone resumes without double-posting). + +# 🏛️ Govbot Legislation Data Catalogs -- CLI / Unix pipe friendliness where possible. CLI is the most portable of solutions. -- GitHub Actionable if possible +govbot pulls data from a registry of git-repo datasets. The bundled default +registry (`actions/govbot/data/registry.json`) ships every US state, DC, the +territories, and federal Congress — see [`actions/govbot/REGISTRY.md`](actions/govbot/REGISTRY.md) +for the format, and the [govbot-data org](https://github.com/govbot-data) for +the dataset repos themselves. + +Coverage today: +- Every US state legislature +- US territories (DC, PR, GU, VI, MP) +- US federal (Congress) + +Override the registry with `GOVBOT_REGISTRY_URL=` or a project-local +`.govbot/registry.json`. + +## Lineage + +Govbot's civic-tech application — feeding state legislative data into +per-topic Bluesky bots — was first proven by Frankie Vegliante's +**CHN-Bluesky-Govbot-Main** framework +([github.com/frankies2727/CHN-Bluesky-Govbot-Main](https://github.com/frankies2727/CHN-Bluesky-Govbot-Main)). +That framework's design — per-topic configs, GitHub Actions cron, per-topic +state ledger, and a shared posting pipeline across 13 issue-area Bluesky +bots (transportation, housing, education, immigration, …) — is the pattern +that govbot's 4-tool architecture and the climate-activist deployment both +build on. Govbot's planned `govbot init --from-frankie-config` flag (Phase +1b, in flight) lets a CHN topic migrate to this stack with its keywords, +emoji map, and posted-state history intact. ## Contribute diff --git a/actions/govbot/Cargo.lock b/actions/govbot/Cargo.lock index e6b5fc7f..62a3a2cc 100644 --- a/actions/govbot/Cargo.lock +++ b/actions/govbot/Cargo.lock @@ -123,18 +123,6 @@ version = "1.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c08606f8c3cbf4ce6ec8e28fb0014a2c086708fe954eaa885384a6165172e7e8" -[[package]] -name = "base64" -version = "0.13.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9e1b586273c5702936fe7b7d6896644d8be71e6314cfe09d3167c95f712589e8" - -[[package]] -name = "base64" -version = "0.21.7" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9d297deb1925b89f2ccc13d7635fa0714f12c87adce1c75356b39ca9b7178567" - [[package]] name = "base64" version = "0.22.1" @@ -143,15 +131,9 @@ checksum = "72b3254f16251a8381aa12e40e3c4d2f0199f8c6508fbecb9d91f575e0fbb8c6" [[package]] name = "base64ct" -version = "1.8.1" +version = "1.8.3" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0e050f626429857a27ddccb31e0aca21356bfa709c04041aefddac081a8f068a" - -[[package]] -name = "bitflags" -version = "1.3.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bef38d45163c2f1dde094a7dfd33ccf595c92905c8f8f4fdc18d06fb1037718a" +checksum = "2af50177e190e07a26ab74f8b1efbfe2ef87da2116221318cb1c2e82baf7de06" [[package]] name = "bitflags" @@ -174,12 +156,6 @@ version = "3.19.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "46c5e41b57b8bba42a04676d81cb89e9ee8e859a1a66f80a5a72e1cb76b34d43" -[[package]] -name = "byteorder" -version = "1.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fd0f2584146f6f2ef48085050886acf353beff7305ebd1ae69500e27c67f64b" - [[package]] name = "bytes" version = "1.11.0" @@ -277,6 +253,35 @@ dependencies = [ "windows-sys 0.59.0", ] +[[package]] +name = "cookie" +version = "0.18.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "4ddef33a339a91ea89fb53151bd0a4689cfce27055c291dfa69945475d22c747" +dependencies = [ + "percent-encoding", + "time", + "version_check", +] + +[[package]] +name = "cookie_store" +version = "0.22.1" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "15b2c103cf610ec6cae3da84a766285b42fd16aad564758459e6ecf128c75206" +dependencies = [ + "cookie", + "document-features", + "idna", + "indexmap", + "log", + "serde", + "serde_derive", + "serde_json", + "time", + "url", +] + [[package]] name = "core-foundation" version = "0.9.4" @@ -414,14 +419,23 @@ dependencies = [ [[package]] name = "der" -version = "0.7.10" +version = "0.8.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e7c1832837b905bbfb5101e07cc24c8deddf52f93225eee6ead5f4d63d53ddcb" +checksum = "71fd89660b2dc699704064e59e9dba0147b903e85319429e131620d022be411b" dependencies = [ "pem-rfc7468", "zeroize", ] +[[package]] +name = "deranged" +version = "0.5.8" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "7cd812cc2bc1d69d4764bd80df88b4317eaef9e773c75226407d9bc0876b211c" +dependencies = [ + "powerfmt", +] + [[package]] name = "derive_builder" version = "0.20.2" @@ -496,6 +510,15 @@ dependencies = [ "syn", ] +[[package]] +name = "document-features" +version = "0.2.12" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d4b8a88685455ed29a21542a33abd9cb6510b6b129abadabdcef0f4c55bc8f61" +dependencies = [ + "litrs", +] + [[package]] name = "either" version = "1.15.0" @@ -533,33 +556,12 @@ dependencies = [ "windows-sys 0.61.2", ] -[[package]] -name = "esaxx-rs" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d817e038c30374a4bcb22f94d0a8a0e216958d4c3dcde369b1439fec4bdda6e6" -dependencies = [ - "cc", -] - [[package]] name = "fastrand" version = "2.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "37909eebbb50d72f9059c3b6d82c0463f2ff062c9e95845c43a6c9c0355411be" -[[package]] -name = "filetime" -version = "0.2.26" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bc0505cd1b6fa6580283f6bdf70a73fcf4aba1184038c90902b92b3dd0df63ed" -dependencies = [ - "cfg-if", - "libc", - "libredox", - "windows-sys 0.60.2", -] - [[package]] name = "find-msvc-tools" version = "0.1.5" @@ -705,17 +707,6 @@ dependencies = [ "version_check", ] -[[package]] -name = "getrandom" -version = "0.2.16" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "335ff9f135e4384c8150d6f27c6daed433577f86b4750418338c01a1a2528592" -dependencies = [ - "cfg-if", - "libc", - "wasi", -] - [[package]] name = "getrandom" version = "0.3.4" @@ -734,7 +725,7 @@ version = "0.18.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "232e6a7bfe35766bf715e55a88b39a700596c0ccfd88cd3680b4cdb40d66ef70" dependencies = [ - "bitflags 2.10.0", + "bitflags", "libc", "libgit2-sys", "log", @@ -756,11 +747,8 @@ dependencies = [ "git2", "insta", "jwalk", - "ndarray 0.15.6", - "ort", "pathdiff", "regex", - "reqwest", "rss", "serde", "serde_json", @@ -768,29 +756,9 @@ dependencies = [ "sha2", "tempfile", "thiserror", - "tokenizers", "tokio", "tokio-test", - "toml", -] - -[[package]] -name = "h2" -version = "0.3.27" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0beca50380b1fc32983fc1cb4587bfa4bb9e78fc259aad4a0032d2080309222d" -dependencies = [ - "bytes", - "fnv", - "futures-core", - "futures-sink", - "futures-util", - "http 0.2.12", - "indexmap", - "slab", - "tokio", - "tokio-util", - "tracing", + "ureq", ] [[package]] @@ -805,17 +773,6 @@ version = "0.5.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2304e00983f87ffb38b55b444b5e3b60a884b5d30c0fca7d82fe33449bbe55ea" -[[package]] -name = "http" -version = "0.2.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "601cbb57e577e2f5ef5be8e7b83f0f63994f25aa94d673e54a92d5c516d101f1" -dependencies = [ - "bytes", - "fnv", - "itoa", -] - [[package]] name = "http" version = "1.4.0" @@ -826,66 +783,12 @@ dependencies = [ "itoa", ] -[[package]] -name = "http-body" -version = "0.4.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7ceab25649e9960c0311ea418d17bee82c0dcec1bd053b5f9a66e265a693bed2" -dependencies = [ - "bytes", - "http 0.2.12", - "pin-project-lite", -] - [[package]] name = "httparse" version = "1.10.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6dbf3de79e51f3d586ab4cb9d5c3e2c14aa28ed23d180cf89b4df0454a69cc87" -[[package]] -name = "httpdate" -version = "1.0.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "df3b46402a9d5adb4c86a0cf463f42e19994e3ee891101b1841f30a545cb49a9" - -[[package]] -name = "hyper" -version = "0.14.32" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "41dfc780fdec9373c01bae43289ea34c972e40ee3c9f6b3c8801a35f35586ce7" -dependencies = [ - "bytes", - "futures-channel", - "futures-core", - "futures-util", - "h2", - "http 0.2.12", - "http-body", - "httparse", - "httpdate", - "itoa", - "pin-project-lite", - "socket2 0.5.10", - "tokio", - "tower-service", - "tracing", - "want", -] - -[[package]] -name = "hyper-tls" -version = "0.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d6183ddfa99b85da61a140bea0efc93fdf56ceaa041b37d553518030827f9905" -dependencies = [ - "bytes", - "hyper", - "native-tls", - "tokio", - "tokio-native-tls", -] - [[package]] name = "iana-time-zone" version = "0.1.64" @@ -946,7 +849,7 @@ dependencies = [ "icu_normalizer_data", "icu_properties", "icu_provider", - "smallvec 1.15.1", + "smallvec", "zerovec", ] @@ -1004,7 +907,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "3b0875f23caa03898994f6ddc501886a45c7d3d62d04d2d90788d47be1b1e4de" dependencies = [ "idna_adapter", - "smallvec 1.15.1", + "smallvec", "utf8_iter", ] @@ -1028,19 +931,6 @@ dependencies = [ "hashbrown", ] -[[package]] -name = "indicatif" -version = "0.17.11" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "183b3088984b400f4cfac3620d5e076c84da5364016b4f49473de574b2586235" -dependencies = [ - "console", - "number_prefix", - "portable-atomic", - "unicode-width", - "web-time", -] - [[package]] name = "insta" version = "1.44.3" @@ -1053,36 +943,12 @@ dependencies = [ "similar", ] -[[package]] -name = "ipnet" -version = "2.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "469fb0b9cefa57e3ef31275ee7cacb78f2fdca44e4765491884a2b119d4eb130" - [[package]] name = "is_terminal_polyfill" version = "1.70.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "a6cb138bb79a146c1bd460005623e142ef0181e3d0219cb493e02f7d08a35695" -[[package]] -name = "itertools" -version = "0.11.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b1c173a5686ce8bfa551b3563d0c2170bf24ca44da99c7ca4bfdab5418c3fe57" -dependencies = [ - "either", -] - -[[package]] -name = "itertools" -version = "0.12.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba291022dbbd398a455acf126c1e341954079855bc60dfdda641363bd6922569" -dependencies = [ - "either", -] - [[package]] name = "itoa" version = "1.0.15" @@ -1095,7 +961,7 @@ version = "0.1.34" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9afb3de4395d6b3e67a780b6de64b51c978ecf11cb9a462c66be7d4ca9039d33" dependencies = [ - "getrandom 0.3.4", + "getrandom", "libc", ] @@ -1119,12 +985,6 @@ dependencies = [ "rayon", ] -[[package]] -name = "lazy_static" -version = "1.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bbd2bcb4c963f2ddae06a2efc7e9f3591312473c50c6685e1f298068316e66fe" - [[package]] name = "libc" version = "0.2.178" @@ -1145,17 +1005,6 @@ dependencies = [ "pkg-config", ] -[[package]] -name = "libredox" -version = "0.1.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "416f7e718bdb06000964960ffa43b4335ad4012ae8b99060261aa4a8088d5ccb" -dependencies = [ - "bitflags 2.10.0", - "libc", - "redox_syscall", -] - [[package]] name = "libssh2-sys" version = "0.3.1" @@ -1195,36 +1044,16 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "6373607a59f0be73a39b6fe456b8192fcc3585f602af20751600e974dd455e77" [[package]] -name = "log" -version = "0.4.29" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" - -[[package]] -name = "macro_rules_attribute" -version = "0.2.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "65049d7923698040cd0b1ddcced9b0eb14dd22c5f86ae59c3740eab64a676520" -dependencies = [ - "macro_rules_attribute-proc_macro", - "paste", -] - -[[package]] -name = "macro_rules_attribute-proc_macro" -version = "0.2.2" +name = "litrs" +version = "1.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "670fdfda89751bc4a84ac13eaa63e205cf0fd22b4c9a5fbfa085b63c1f1d3a30" +checksum = "11d3d7f243d5c5a8b9bb5d6dd2b1602c0cb0b9db1621bafc7ed66e35ff9fe092" [[package]] -name = "matrixmultiply" -version = "0.3.10" +name = "log" +version = "0.4.29" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a06de3016e9fae57a36fd14dba131fccf49f74b40b7fbdb472f96e361ec71a08" -dependencies = [ - "autocfg", - "rawpointer", -] +checksum = "5e5032e24019045c762d3c0f28f5b6b8bbf38563a65908389bf7978758920897" [[package]] name = "memchr" @@ -1232,18 +1061,6 @@ version = "2.7.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "f52b00d39961fc5b2736ea853c9cc86238e165017a493d1d5c8eac6bdc4cc273" -[[package]] -name = "mime" -version = "0.3.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "6877bb514081ee2a7ff5ef9de3281f14a4dd4bceac4c09388074a6b5df8a139a" - -[[package]] -name = "minimal-lexical" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "68354c5c6bd36d73ff3feceb05efa59b6acb7626617f4962be322a825e61f79a" - [[package]] name = "miniz_oxide" version = "0.8.9" @@ -1254,39 +1071,6 @@ dependencies = [ "simd-adler32", ] -[[package]] -name = "mio" -version = "1.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a69bcab0ad47271a0234d9422b131806bf3968021e5dc9328caf2d4cd58557fc" -dependencies = [ - "libc", - "wasi", - "windows-sys 0.61.2", -] - -[[package]] -name = "monostate" -version = "0.1.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "3341a273f6c9d5bef1908f17b7267bbab0e95c9bf69a0d4dcf8e9e1b2c76ef67" -dependencies = [ - "monostate-impl", - "serde", - "serde_core", -] - -[[package]] -name = "monostate-impl" -version = "0.1.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e4db6d5580af57bf992f59068d4ea26fd518574ff48d7639b255a36f9de6e7e9" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "native-tls" version = "0.2.14" @@ -1304,34 +1088,6 @@ dependencies = [ "tempfile", ] -[[package]] -name = "ndarray" -version = "0.15.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "adb12d4e967ec485a5f71c6311fe28158e9d6f4bc4a447b474184d0f91a8fa32" -dependencies = [ - "matrixmultiply", - "num-complex", - "num-integer", - "num-traits", - "rawpointer", -] - -[[package]] -name = "ndarray" -version = "0.16.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "882ed72dce9365842bf196bdeedf5055305f11fc8c03dee7bb0194a6cad34841" -dependencies = [ - "matrixmultiply", - "num-complex", - "num-integer", - "num-traits", - "portable-atomic", - "portable-atomic-util", - "rawpointer", -] - [[package]] name = "never" version = "0.1.0" @@ -1339,32 +1095,10 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "c96aba5aa877601bb3f6dd6a63a969e1f82e60646e81e71b14496995e9853c91" [[package]] -name = "nom" -version = "7.1.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d273983c5a657a70a3e8f2a01329822f3b8c8172b73826411a55751e404a0a4a" -dependencies = [ - "memchr", - "minimal-lexical", -] - -[[package]] -name = "num-complex" -version = "0.4.6" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "73f88a1307638156682bada9d7604135552957b7818057dcef22705b4d509495" -dependencies = [ - "num-traits", -] - -[[package]] -name = "num-integer" -version = "0.1.46" +name = "num-conv" +version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7969661fd2958a5cb096e56c8e1ad0444ac2bbcd0061bd28660485a44879858f" -dependencies = [ - "num-traits", -] +checksum = "521739c6d2bac4aa25192232afe6841231376b2b26d4d9fae5ecf8ca5772e441" [[package]] name = "num-traits" @@ -1375,12 +1109,6 @@ dependencies = [ "autocfg", ] -[[package]] -name = "number_prefix" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "830b246a0e5f20af87141b25c173cd1b609bd7779a4617d6ec582abaf90870f3" - [[package]] name = "once_cell" version = "1.21.3" @@ -1393,35 +1121,13 @@ version = "1.70.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "384b8ab6d37215f3c5301a95a4accb5d64aa607f1fcb26a11b5303878451b4fe" -[[package]] -name = "onig" -version = "6.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "336b9c63443aceef14bea841b899035ae3abe89b7c486aaf4c5bd8aafedac3f0" -dependencies = [ - "bitflags 2.10.0", - "libc", - "once_cell", - "onig_sys", -] - -[[package]] -name = "onig_sys" -version = "69.9.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c7f86c6eef3d6df15f23bcfb6af487cbd2fed4e5581d58d5bf1f5f8b7f6727dc" -dependencies = [ - "cc", - "pkg-config", -] - [[package]] name = "openssl" version = "0.10.75" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "08838db121398ad17ab8531ce9de97b244589089e290a384c900cb9ff7434328" dependencies = [ - "bitflags 2.10.0", + "bitflags", "cfg-if", "foreign-types", "libc", @@ -1460,47 +1166,16 @@ dependencies = [ ] [[package]] -name = "ort" -version = "2.0.0-rc.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fa7e49bd669d32d7bc2a15ec540a527e7764aec722a45467814005725bcd721" -dependencies = [ - "ndarray 0.16.1", - "ort-sys", - "smallvec 2.0.0-alpha.10", - "tracing", -] - -[[package]] -name = "ort-sys" -version = "2.0.0-rc.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e2aba9f5c7c479925205799216e7e5d07cc1d4fa76ea8058c60a9a30f6a4e890" -dependencies = [ - "flate2", - "pkg-config", - "sha2", - "tar", - "ureq", -] - -[[package]] -name = "paste" -version = "1.0.15" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "57c0d7b74b563b49d38dae00a0c37d4d6de9b432382b2892f0574ddcae73fd0a" - -[[package]] -name = "pathdiff" -version = "0.2.3" +name = "pathdiff" +version = "0.2.3" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "df94ce210e5bc13cb6651479fa48d14f601d9858cfe0467f43ae157023b938d3" [[package]] name = "pem-rfc7468" -version = "0.7.0" +version = "1.0.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "88b39c9bfcfc231068454382784bb460aae594343fb030d46e9f50a645418412" +checksum = "a6305423e0e7738146434843d1694d621cce767262b2a86910beab705e4493d9" dependencies = [ "base64ct", ] @@ -1529,21 +1204,6 @@ version = "0.3.32" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7edddbd0b52d732b21ad9a5fab5c704c14cd949e5e9a1ec5929a24fded1b904c" -[[package]] -name = "portable-atomic" -version = "1.11.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f84267b20a16ea918e43c6a88433c2d54fa145c92a811b5b047ccbe153674483" - -[[package]] -name = "portable-atomic-util" -version = "0.2.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d8a2f0d8d040d7848a709caf78912debcc3f33ee4b3cac47d73d1e1069e83507" -dependencies = [ - "portable-atomic", -] - [[package]] name = "potential_utf" version = "0.1.4" @@ -1554,13 +1214,10 @@ dependencies = [ ] [[package]] -name = "ppv-lite86" -version = "0.2.21" +name = "powerfmt" +version = "0.2.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "85eae3c4ed2f50dcfe72643da4befc30deadb458a9b590d720cde2f2b1e97da9" -dependencies = [ - "zerocopy", -] +checksum = "439ee305def115ba05938db6eb1644ff94165c5ab5e9420d1c1bcedbba909391" [[package]] name = "proc-macro2" @@ -1596,42 +1253,6 @@ version = "5.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "69cdb34c158ceb288df11e18b4bd39de994f6657d83847bdffdbd7f346754b0f" -[[package]] -name = "rand" -version = "0.8.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404" -dependencies = [ - "libc", - "rand_chacha", - "rand_core", -] - -[[package]] -name = "rand_chacha" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88" -dependencies = [ - "ppv-lite86", - "rand_core", -] - -[[package]] -name = "rand_core" -version = "0.6.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ec0be4795e2f6a28069bec0b5ff3e2ac9bafc99e6a9a7dc3547996c5c816922c" -dependencies = [ - "getrandom 0.2.16", -] - -[[package]] -name = "rawpointer" -version = "0.2.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "60a357793950651c4ed0f3f52338f53b2f809f32d83a07f72909fa13e4c6c1e3" - [[package]] name = "rayon" version = "1.11.0" @@ -1642,17 +1263,6 @@ dependencies = [ "rayon-core", ] -[[package]] -name = "rayon-cond" -version = "0.3.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "059f538b55efd2309c9794130bc149c6a553db90e9d99c2030785c82f0bd7df9" -dependencies = [ - "either", - "itertools 0.11.0", - "rayon", -] - [[package]] name = "rayon-core" version = "1.13.0" @@ -1663,15 +1273,6 @@ dependencies = [ "crossbeam-utils", ] -[[package]] -name = "redox_syscall" -version = "0.5.18" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ed2bf2547551a7053d6fdfafda3f938979645c44812fbfcda098faae3f1a362d" -dependencies = [ - "bitflags 2.10.0", -] - [[package]] name = "regex" version = "1.12.2" @@ -1701,46 +1302,6 @@ version = "0.8.8" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "7a2d987857b319362043e95f5353c0535c1f58eec5336fdfcf626430af7def58" -[[package]] -name = "reqwest" -version = "0.11.27" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dd67538700a17451e7cba03ac727fb961abb7607553461627b97de0b89cf4a62" -dependencies = [ - "base64 0.21.7", - "bytes", - "encoding_rs", - "futures-core", - "futures-util", - "h2", - "http 0.2.12", - "http-body", - "hyper", - "hyper-tls", - "ipnet", - "js-sys", - "log", - "mime", - "native-tls", - "once_cell", - "percent-encoding", - "pin-project-lite", - "rustls-pemfile", - "serde", - "serde_json", - "serde_urlencoded", - "sync_wrapper", - "system-configuration", - "tokio", - "tokio-native-tls", - "tower-service", - "url", - "wasm-bindgen", - "wasm-bindgen-futures", - "web-sys", - "winreg", -] - [[package]] name = "rss" version = "2.0.12" @@ -1759,27 +1320,18 @@ version = "1.1.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "cd15f8a2c5551a84d56efdc1cd049089e409ac19a3072d5037a17fd70719ff3e" dependencies = [ - "bitflags 2.10.0", + "bitflags", "errno", "libc", "linux-raw-sys", "windows-sys 0.61.2", ] -[[package]] -name = "rustls-pemfile" -version = "1.0.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1c74cae0a4cf6ccbbf5f359f08efdf8ee7e1dc532573bf0db71968cb56b1448c" -dependencies = [ - "base64 0.21.7", -] - [[package]] name = "rustls-pki-types" -version = "1.13.1" +version = "1.14.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "708c0f9d5f54ba0272468c1d306a52c495b31fa155e91bc25371e6df7996908c" +checksum = "30a7197ae7eb376e574fe940d068c30fe0462554a3ddbe4eca7838e049c937a9" dependencies = [ "zeroize", ] @@ -1798,9 +1350,9 @@ checksum = "28d3b2b1366ec20994f1fd18c3c594f05c5dd4bc44d8bb0c1c632c8d6829481f" [[package]] name = "schannel" -version = "0.1.28" +version = "0.1.29" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "891d81b926048e76efe18581bf793546b4c0eaf8448d72be8de2bbee5fd166e1" +checksum = "91c1b7e4904c873ef0710c1f407dde2e6287de2bebc1bbbf7d430bb7cbffd939" dependencies = [ "windows-sys 0.61.2", ] @@ -1811,7 +1363,7 @@ version = "2.11.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "897b2245f0b511c87893af39b033e5ca9cce68824c4d7e7630b5a1d339658d02" dependencies = [ - "bitflags 2.10.0", + "bitflags", "core-foundation", "core-foundation-sys", "libc", @@ -1820,9 +1372,9 @@ dependencies = [ [[package]] name = "security-framework-sys" -version = "2.15.0" +version = "2.17.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "cc1f0cbffaac4852523ce30d8bd3c5cdc873501d96ff467ca09b6767bb8cd5c0" +checksum = "6ce2691df843ecc5d231c0b14ece2acc3efb62c0a398c7e1d875f3983ce020e3" dependencies = [ "core-foundation-sys", "libc", @@ -1871,27 +1423,6 @@ dependencies = [ "serde_core", ] -[[package]] -name = "serde_spanned" -version = "0.6.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bf41e0cfaf7226dca15e8197172c295a782857fcb97fad1808a166870dee75a3" -dependencies = [ - "serde", -] - -[[package]] -name = "serde_urlencoded" -version = "0.7.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d3491c14715ca2294c4d6a88f15e84739788c1d030eed8c110436aafdaa2f3fd" -dependencies = [ - "form_urlencoded", - "itoa", - "ryu", - "serde", -] - [[package]] name = "serde_yaml" version = "0.9.34+deprecated" @@ -1952,55 +1483,6 @@ version = "1.15.1" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "67b1b7a3b5fe4f1376887184045fcf45c69e92af734b7aaddc05fb777b6fbd03" -[[package]] -name = "smallvec" -version = "2.0.0-alpha.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "51d44cfb396c3caf6fbfd0ab422af02631b69ddd96d2eff0b0f0724f9024051b" - -[[package]] -name = "socket2" -version = "0.5.10" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e22376abed350d73dd1cd119b57ffccad95b4e585a7cda43e286245ce23c0678" -dependencies = [ - "libc", - "windows-sys 0.52.0", -] - -[[package]] -name = "socket2" -version = "0.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "17129e116933cf371d018bb80ae557e889637989d8638274fb25622827b03881" -dependencies = [ - "libc", - "windows-sys 0.60.2", -] - -[[package]] -name = "socks" -version = "0.3.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f0c3dbbd9ae980613c6dd8e28a9407b50509d3803b57624d5dfe8315218cd58b" -dependencies = [ - "byteorder", - "libc", - "winapi", -] - -[[package]] -name = "spm_precompiled" -version = "0.1.4" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5851699c4033c63636f7ea4cf7b7c1f1bf06d0cc03cfb42e711de5a5c46cf326" -dependencies = [ - "base64 0.13.1", - "nom", - "serde", - "unicode-segmentation", -] - [[package]] name = "stable_deref_trait" version = "1.2.1" @@ -2024,12 +1506,6 @@ dependencies = [ "unicode-ident", ] -[[package]] -name = "sync_wrapper" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2047c6ded9c721764247e62cd3b03c09ffc529b2ba5b10ec482ae507a4a70160" - [[package]] name = "synstructure" version = "0.13.2" @@ -2041,38 +1517,6 @@ dependencies = [ "syn", ] -[[package]] -name = "system-configuration" -version = "0.5.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ba3a3adc5c275d719af8cb4272ea1c4a6d668a777f37e115f6d11ddbc1c8e0e7" -dependencies = [ - "bitflags 1.3.2", - "core-foundation", - "system-configuration-sys", -] - -[[package]] -name = "system-configuration-sys" -version = "0.5.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a75fb188eb626b924683e3b95e3a48e63551fcfb51949de2f06a9d91dbee93c9" -dependencies = [ - "core-foundation-sys", - "libc", -] - -[[package]] -name = "tar" -version = "0.4.44" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1d863878d212c87a19c1a610eb53bb01fe12951c0501cf5a0d65f724914a667a" -dependencies = [ - "filetime", - "libc", - "xattr", -] - [[package]] name = "tempfile" version = "3.23.0" @@ -2080,7 +1524,7 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "2d31c77bdf42a745371d260a26ca7163f1e0924b64afa0b688e61b5a9fa02f16" dependencies = [ "fastrand", - "getrandom 0.3.4", + "getrandom", "once_cell", "rustix", "windows-sys 0.61.2", @@ -2107,45 +1551,44 @@ dependencies = [ ] [[package]] -name = "tinystr" -version = "0.8.2" +name = "time" +version = "0.3.47" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "42d3e9c45c09de15d06dd8acf5f4e0e399e85927b7f00711024eb7ae10fa4869" +checksum = "743bd48c283afc0388f9b8827b976905fb217ad9e647fae3a379a9283c4def2c" dependencies = [ - "displaydoc", - "zerovec", + "deranged", + "itoa", + "num-conv", + "powerfmt", + "serde_core", + "time-core", + "time-macros", ] [[package]] -name = "tokenizers" -version = "0.19.1" +name = "time-core" +version = "0.1.8" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e500fad1dd3af3d626327e6a3fe5050e664a6eaa4708b8ca92f1794aaf73e6fd" +checksum = "7694e1cfe791f8d31026952abf09c69ca6f6fa4e1a1229e18988f06a04a12dca" + +[[package]] +name = "time-macros" +version = "0.2.27" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "2e70e4c5a0e0a8a4823ad65dfe1a6930e4f4d756dcd9dd7939022b5e8c501215" dependencies = [ - "aho-corasick", - "derive_builder", - "esaxx-rs", - "getrandom 0.2.16", - "indicatif", - "itertools 0.12.1", - "lazy_static", - "log", - "macro_rules_attribute", - "monostate", - "onig", - "paste", - "rand", - "rayon", - "rayon-cond", - "regex", - "regex-syntax", - "serde", - "serde_json", - "spm_precompiled", - "thiserror", - "unicode-normalization-alignments", - "unicode-segmentation", - "unicode_categories", + "num-conv", + "time-core", +] + +[[package]] +name = "tinystr" +version = "0.8.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "42d3e9c45c09de15d06dd8acf5f4e0e399e85927b7f00711024eb7ae10fa4869" +dependencies = [ + "displaydoc", + "zerovec", ] [[package]] @@ -2154,13 +1597,8 @@ version = "1.48.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "ff360e02eab121e0bc37a2d3b4d4dc622e6eda3a8e5253d5435ecf5bd4c68408" dependencies = [ - "bytes", - "libc", - "mio", "pin-project-lite", - "socket2 0.6.1", "tokio-macros", - "windows-sys 0.61.2", ] [[package]] @@ -2174,16 +1612,6 @@ dependencies = [ "syn", ] -[[package]] -name = "tokio-native-tls" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bbae76ab933c85776efabc971569dd6119c580d8f5d448769dec1764bf796ef2" -dependencies = [ - "native-tls", - "tokio", -] - [[package]] name = "tokio-stream" version = "0.1.17" @@ -2208,91 +1636,6 @@ dependencies = [ "tokio-stream", ] -[[package]] -name = "tokio-util" -version = "0.7.17" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2efa149fe76073d6e8fd97ef4f4eca7b67f599660115591483572e406e165594" -dependencies = [ - "bytes", - "futures-core", - "futures-sink", - "pin-project-lite", - "tokio", -] - -[[package]] -name = "toml" -version = "0.8.23" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dc1beb996b9d83529a9e75c17a1686767d148d70663143c7854d8b4a09ced362" -dependencies = [ - "serde", - "serde_spanned", - "toml_datetime", - "toml_edit", -] - -[[package]] -name = "toml_datetime" -version = "0.6.11" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "22cddaf88f4fbc13c51aebbf5f8eceb5c7c5a9da2ac40a13519eb5b0a0e8f11c" -dependencies = [ - "serde", -] - -[[package]] -name = "toml_edit" -version = "0.22.27" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "41fe8c660ae4257887cf66394862d21dbca4a6ddd26f04a3560410406a2f819a" -dependencies = [ - "indexmap", - "serde", - "serde_spanned", - "toml_datetime", - "toml_write", - "winnow", -] - -[[package]] -name = "toml_write" -version = "0.1.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5d99f8c9a7727884afe522e9bd5edbfc91a3312b36a77b5fb8926e4c31a41801" - -[[package]] -name = "tower-service" -version = "0.3.3" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8df9b6e13f2d32c91b9bd719c00d1958837bc7dec474d94952798cc8e69eeec3" - -[[package]] -name = "tracing" -version = "0.1.43" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2d15d90a0b5c19378952d479dc858407149d7bb45a14de0142f6c534b16fc647" -dependencies = [ - "pin-project-lite", - "tracing-core", -] - -[[package]] -name = "tracing-core" -version = "0.1.35" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "7a04e24fab5c89c6a36eb8558c9656f30d81de51dfa4d3b45f26b21d61fa0a6c" -dependencies = [ - "once_cell", -] - -[[package]] -name = "try-lock" -version = "0.2.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "e421abadd41a4225275504ea4d6566923418b7f05506fbc9c0fe86ba7396114b" - [[package]] name = "typenum" version = "1.19.0" @@ -2305,33 +1648,12 @@ version = "1.0.22" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9312f7c4f6ff9069b165498234ce8be658059c6728633667c526e27dc2cf1df5" -[[package]] -name = "unicode-normalization-alignments" -version = "0.1.12" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "43f613e4fa046e69818dd287fdc4bc78175ff20331479dab6e1b0f98d57062de" -dependencies = [ - "smallvec 1.15.1", -] - -[[package]] -name = "unicode-segmentation" -version = "1.12.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f6ccf251212114b54433ec949fd6a7841275f9ada20dddd2f29e9ceea4501493" - [[package]] name = "unicode-width" version = "0.2.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b4ac048d71ede7ee76d585517add45da530660ef4390e49b098733c6e897f254" -[[package]] -name = "unicode_categories" -version = "0.1.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "39ec24b3121d976906ece63c9daad25b85969647682eee313cb5779fdd69e14e" - [[package]] name = "unsafe-libyaml" version = "0.2.11" @@ -2340,30 +1662,33 @@ checksum = "673aac59facbab8a9007c7f6108d11f63b603f7cabff99fabf650fea5c32b861" [[package]] name = "ureq" -version = "3.1.4" +version = "3.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d39cb1dbab692d82a977c0392ffac19e188bd9186a9f32806f0aaa859d75585a" +checksum = "dea7109cdcd5864d4eeb1b58a1648dc9bf520360d7af16ec26d0a9354bafcfc0" dependencies = [ - "base64 0.22.1", + "base64", + "cookie_store", "der", + "flate2", "log", "native-tls", "percent-encoding", "rustls-pki-types", - "socks", + "serde", + "serde_json", "ureq-proto", - "utf-8", + "utf8-zero", "webpki-root-certs", ] [[package]] name = "ureq-proto" -version = "0.5.3" +version = "0.6.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d81f9efa9df032be5934a46a068815a10a042b494b6a58cb0a1a97bb5467ed6f" +checksum = "e994ba84b0bd1b1b0cf92878b7ef898a5c1760108fe7b6010327e274917a808c" dependencies = [ - "base64 0.22.1", - "http 1.4.0", + "base64", + "http", "httparse", "log", ] @@ -2381,10 +1706,10 @@ dependencies = [ ] [[package]] -name = "utf-8" -version = "0.7.6" +name = "utf8-zero" +version = "0.8.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "09cc8ee72d2a9becf2f2febe0205bbed8fc6615b7cb429ad062dc7b7ddd036a9" +checksum = "b8c0a043c9540bae7c578c88f91dda8bd82e59ae27c21baca69c8b191aaf5a6e" [[package]] name = "utf8_iter" @@ -2410,21 +1735,6 @@ version = "0.9.5" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0b928f33d975fc6ad9f86c8f283853ad26bdd5b10b7f1542aa2fa15e2289105a" -[[package]] -name = "want" -version = "0.3.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "bfa7760aed19e106de2c7c0b581b509f2f25d3dacaf737cb82ac61bc6d760b0e" -dependencies = [ - "try-lock", -] - -[[package]] -name = "wasi" -version = "0.11.1+wasi-snapshot-preview1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ccf3ec651a847eb01de73ccad15eb7d99f80485de043efb2f370cd654f4ea44b" - [[package]] name = "wasip2" version = "1.0.1+wasi-0.2.4" @@ -2447,19 +1757,6 @@ dependencies = [ "wasm-bindgen-shared", ] -[[package]] -name = "wasm-bindgen-futures" -version = "0.4.56" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "836d9622d604feee9e5de25ac10e3ea5f2d65b41eac0d9ce72eb5deae707ce7c" -dependencies = [ - "cfg-if", - "js-sys", - "once_cell", - "wasm-bindgen", - "web-sys", -] - [[package]] name = "wasm-bindgen-macro" version = "0.2.106" @@ -2492,57 +1789,15 @@ dependencies = [ "unicode-ident", ] -[[package]] -name = "web-sys" -version = "0.3.83" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9b32828d774c412041098d182a8b38b16ea816958e07cf40eec2bc080ae137ac" -dependencies = [ - "js-sys", - "wasm-bindgen", -] - -[[package]] -name = "web-time" -version = "1.1.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5a6580f308b1fad9207618087a65c04e7a10bc77e02c8e84e9b00dd4b12fa0bb" -dependencies = [ - "js-sys", - "wasm-bindgen", -] - [[package]] name = "webpki-root-certs" -version = "1.0.4" +version = "1.0.7" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ee3e3b5f5e80bc89f30ce8d0343bf4e5f12341c51f3e26cbeecbc7c85443e85b" +checksum = "f31141ce3fc3e300ae89b78c0dd67f9708061d1d2eda54b8209346fd6be9a92c" dependencies = [ "rustls-pki-types", ] -[[package]] -name = "winapi" -version = "0.3.9" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5c839a674fcd7a98952e593242ea400abe93992746761e38641405d28b00f419" -dependencies = [ - "winapi-i686-pc-windows-gnu", - "winapi-x86_64-pc-windows-gnu", -] - -[[package]] -name = "winapi-i686-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ac3b87c63620426dd9b991e5ce0329eff545bccbbb34f3be09ff6fb6ab51b7b6" - -[[package]] -name = "winapi-x86_64-pc-windows-gnu" -version = "0.4.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "712e227841d057c1ee1cd2fb22fa7e5a5461ae8e48fa2ca79ec42cfc1931183f" - [[package]] name = "windows-core" version = "0.62.2" @@ -2602,40 +1857,13 @@ dependencies = [ "windows-link", ] -[[package]] -name = "windows-sys" -version = "0.48.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "677d2418bec65e3338edb076e806bc1ec15693c5d0104683f2efe857f61056a9" -dependencies = [ - "windows-targets 0.48.5", -] - -[[package]] -name = "windows-sys" -version = "0.52.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "282be5f36a8ce781fad8c8ae18fa3f9beff57ec1b52cb3de0789201425d9a33d" -dependencies = [ - "windows-targets 0.52.6", -] - [[package]] name = "windows-sys" version = "0.59.0" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "1e38bc4d79ed67fd075bcc251a1c39b32a1776bbe92e5bef1f0bf1f8c531853b" dependencies = [ - "windows-targets 0.52.6", -] - -[[package]] -name = "windows-sys" -version = "0.60.2" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "f2f500e4d28234f72040990ec9d39e3a6b950f9f22d3dba18416c35882612bcb" -dependencies = [ - "windows-targets 0.53.5", + "windows-targets", ] [[package]] @@ -2647,211 +1875,70 @@ dependencies = [ "windows-link", ] -[[package]] -name = "windows-targets" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9a2fa6e2155d7247be68c096456083145c183cbbbc2764150dda45a87197940c" -dependencies = [ - "windows_aarch64_gnullvm 0.48.5", - "windows_aarch64_msvc 0.48.5", - "windows_i686_gnu 0.48.5", - "windows_i686_msvc 0.48.5", - "windows_x86_64_gnu 0.48.5", - "windows_x86_64_gnullvm 0.48.5", - "windows_x86_64_msvc 0.48.5", -] - [[package]] name = "windows-targets" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9b724f72796e036ab90c1021d4780d4d3d648aca59e491e6b98e725b84e99973" dependencies = [ - "windows_aarch64_gnullvm 0.52.6", - "windows_aarch64_msvc 0.52.6", - "windows_i686_gnu 0.52.6", - "windows_i686_gnullvm 0.52.6", - "windows_i686_msvc 0.52.6", - "windows_x86_64_gnu 0.52.6", - "windows_x86_64_gnullvm 0.52.6", - "windows_x86_64_msvc 0.52.6", -] - -[[package]] -name = "windows-targets" -version = "0.53.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4945f9f551b88e0d65f3db0bc25c33b8acea4d9e41163edf90dcd0b19f9069f3" -dependencies = [ - "windows-link", - "windows_aarch64_gnullvm 0.53.1", - "windows_aarch64_msvc 0.53.1", - "windows_i686_gnu 0.53.1", - "windows_i686_gnullvm 0.53.1", - "windows_i686_msvc 0.53.1", - "windows_x86_64_gnu 0.53.1", - "windows_x86_64_gnullvm 0.53.1", - "windows_x86_64_msvc 0.53.1", + "windows_aarch64_gnullvm", + "windows_aarch64_msvc", + "windows_i686_gnu", + "windows_i686_gnullvm", + "windows_i686_msvc", + "windows_x86_64_gnu", + "windows_x86_64_gnullvm", + "windows_x86_64_msvc", ] -[[package]] -name = "windows_aarch64_gnullvm" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "2b38e32f0abccf9987a4e3079dfb67dcd799fb61361e53e2882c3cbaf0d905d8" - [[package]] name = "windows_aarch64_gnullvm" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "32a4622180e7a0ec044bb555404c800bc9fd9ec262ec147edd5989ccd0c02cd3" -[[package]] -name = "windows_aarch64_gnullvm" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a9d8416fa8b42f5c947f8482c43e7d89e73a173cead56d044f6a56104a6d1b53" - -[[package]] -name = "windows_aarch64_msvc" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "dc35310971f3b2dbbf3f0690a219f40e2d9afcf64f9ab7cc1be722937c26b4bc" - [[package]] name = "windows_aarch64_msvc" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "09ec2a7bb152e2252b53fa7803150007879548bc709c039df7627cabbd05d469" -[[package]] -name = "windows_aarch64_msvc" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "b9d782e804c2f632e395708e99a94275910eb9100b2114651e04744e9b125006" - -[[package]] -name = "windows_i686_gnu" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "a75915e7def60c94dcef72200b9a8e58e5091744960da64ec734a6c6e9b3743e" - [[package]] name = "windows_i686_gnu" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "8e9b5ad5ab802e97eb8e295ac6720e509ee4c243f69d781394014ebfe8bbfa0b" -[[package]] -name = "windows_i686_gnu" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "960e6da069d81e09becb0ca57a65220ddff016ff2d6af6a223cf372a506593a3" - [[package]] name = "windows_i686_gnullvm" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "0eee52d38c090b3caa76c563b86c3a4bd71ef1a819287c19d586d7334ae8ed66" -[[package]] -name = "windows_i686_gnullvm" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fa7359d10048f68ab8b09fa71c3daccfb0e9b559aed648a8f95469c27057180c" - -[[package]] -name = "windows_i686_msvc" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8f55c233f70c4b27f66c523580f78f1004e8b5a8b659e05a4eb49d4166cca406" - [[package]] name = "windows_i686_msvc" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "240948bc05c5e7c6dabba28bf89d89ffce3e303022809e73deaefe4f6ec56c66" -[[package]] -name = "windows_i686_msvc" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1e7ac75179f18232fe9c285163565a57ef8d3c89254a30685b57d83a38d326c2" - -[[package]] -name = "windows_x86_64_gnu" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "53d40abd2583d23e4718fddf1ebec84dbff8381c07cae67ff7768bbf19c6718e" - [[package]] name = "windows_x86_64_gnu" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "147a5c80aabfbf0c7d901cb5895d1de30ef2907eb21fbbab29ca94c5b08b1a78" -[[package]] -name = "windows_x86_64_gnu" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "9c3842cdd74a865a8066ab39c8a7a473c0778a3f29370b5fd6b4b9aa7df4a499" - -[[package]] -name = "windows_x86_64_gnullvm" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0b7b52767868a23d5bab768e390dc5f5c55825b6d30b86c844ff2dc7414044cc" - [[package]] name = "windows_x86_64_gnullvm" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "24d5b23dc417412679681396f2b49f3de8c1473deb516bd34410872eff51ed0d" -[[package]] -name = "windows_x86_64_gnullvm" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "0ffa179e2d07eee8ad8f57493436566c7cc30ac536a3379fdf008f47f6bb7ae1" - -[[package]] -name = "windows_x86_64_msvc" -version = "0.48.5" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "ed94fce61571a4006852b7389a063ab983c02eb1bb37b47f8272ce92d06d9538" - [[package]] name = "windows_x86_64_msvc" version = "0.52.6" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "589f6da84c646204747d1270a2a5661ea66ed1cced2631d546fdfb155959f9ec" -[[package]] -name = "windows_x86_64_msvc" -version = "0.53.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d6bbff5f0aada427a1e5a6da5f1f98158182f26556f345ac9e04d36d0ebed650" - -[[package]] -name = "winnow" -version = "0.7.14" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "5a5364e9d77fcdeeaa6062ced926ee3381faa2ee02d3eb83a5c27a8825540829" -dependencies = [ - "memchr", -] - -[[package]] -name = "winreg" -version = "0.50.0" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "524e57b2c537c0f9b1e69f1965311ec12182b4122e45035b1508cd24d2adadb1" -dependencies = [ - "cfg-if", - "windows-sys 0.48.0", -] - [[package]] name = "wit-bindgen" version = "0.46.0" @@ -2864,16 +1951,6 @@ version = "0.6.2" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9edde0db4769d2dc68579893f2306b26c6ecfbe0ef499b013d731b7b9247e0b9" -[[package]] -name = "xattr" -version = "1.6.1" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "32e45ad4206f6d2479085147f02bc2ef834ac85886624a23575ae137c8aa8156" -dependencies = [ - "libc", - "rustix", -] - [[package]] name = "yoke" version = "0.8.1" @@ -2897,26 +1974,6 @@ dependencies = [ "synstructure", ] -[[package]] -name = "zerocopy" -version = "0.8.31" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "fd74ec98b9250adb3ca554bdde269adf631549f51d8a8f8f0a10b50f1cb298c3" -dependencies = [ - "zerocopy-derive", -] - -[[package]] -name = "zerocopy-derive" -version = "0.8.31" -source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "d8a8d209fdf45cf5138cbb5a506f6b52522a25afccc534d1475dad8e31105c6a" -dependencies = [ - "proc-macro2", - "quote", - "syn", -] - [[package]] name = "zerofrom" version = "0.1.6" diff --git a/actions/govbot/Cargo.toml b/actions/govbot/Cargo.toml index 2c2720aa..2d3f398a 100644 --- a/actions/govbot/Cargo.toml +++ b/actions/govbot/Cargo.toml @@ -2,7 +2,7 @@ name = "govbot" version = "0.1.0" edition = "2021" -description = "Streaming pipeline log government events from distributed government data" +description = "4-tool civic-data stack: pull real legislative data, filter by what you care about (fastclass tagging), publish with receipts (RSS/HTML/JSON/DuckDB/Bluesky), all from a coding-agent-native dev experience designed to run at nearly-free cost on commodity infrastructure." authors = ["sartaj"] [lib] @@ -40,29 +40,23 @@ pathdiff = "0.2" # Git operations git2 = { version = "0.18" } -# Text similarity and embeddings (lightweight, no external models) -# Using ONNX Runtime + tokenizers for semantic embeddings -ort = { version = "2.0.0-rc.10", default-features = true, features = ["ndarray"] } -tokenizers = "0.19" -ndarray = "0.15" -toml = "0.8" -# HTTP client for downloading models -reqwest = { version = "0.11", features = ["blocking"] } -# Hashing for text deduplication +# Hashing for text deduplication (.tag.json text hashes) sha2 = "0.10" # Timestamps chrono = { version = "0.4", features = ["serde"] } # RSS feed generation rss = "2.0" +# HTTP client for the Bluesky publisher's AT Protocol XRPC calls. +# `ureq` is a small, synchronous, blocking HTTP client — it suits a CLI that +# posts from cron/CI (no async runtime needed on the publish path) and keeps +# the dependency tree light. `native-tls` uses the platform TLS stack +# (Secure Transport / SChannel / OpenSSL) — no extra vendored crypto crate. +ureq = { version = "3", default-features = false, features = ["json", "native-tls", "gzip"] } [[bin]] name = "govbot" path = "src/main.rs" -[[bin]] -name = "generate-locale-enum" -path = "src/bin/generate-locale-enum.rs" - [dev-dependencies] tokio-test = "0.4" insta = { version = "1.39", features = ["json"] } diff --git a/actions/govbot/DUCKDB.md b/actions/govbot/DUCKDB.md index 0c63c4c8..42fe67f8 100644 --- a/actions/govbot/DUCKDB.md +++ b/actions/govbot/DUCKDB.md @@ -22,8 +22,8 @@ duckdb --version ## Quick Start ```bash -# 1. Clone repositories first -govbot clone all +# 1. Pull datasets first +govbot pull all # 2. Load into DuckDB govbot load diff --git a/actions/govbot/README.md b/actions/govbot/README.md index 17a7ec85..ec6f5d2d 100644 --- a/actions/govbot/README.md +++ b/actions/govbot/README.md @@ -18,11 +18,12 @@ govbot That's it. If no `govbot.yml` exists, an interactive wizard walks you through setup: -1. **Sources** - Choose all 47 states or pick specific ones -2. **Tags** - Start with an example tag, or get an AI prompt to create your own -3. **Publishing** - RSS feeds configured automatically +1. **Datasets** - Choose all 47 states or pick specific ones +2. **Classification** - Point the manifest at a fastclass classifier bundle +3. **Publishing** - An RSS feed publisher configured automatically -The wizard creates `govbot.yml`, `.gitignore`, and a GitHub Actions workflow. +The wizard creates `govbot.yml` (a project manifest: `datasets` / `transforms` / +`publish` / `pipelines`), `.gitignore`, and a GitHub Actions workflow. ### 3. Run the pipeline @@ -32,18 +33,19 @@ govbot With `govbot.yml` present, running `govbot` executes the full pipeline: -1. Clones/updates legislation repositories (smart: only clones on first run, pulls after) -2. Tags bills based on your tag definitions -3. Generates RSS feeds in the `docs/` directory +1. Pulls/updates legislation datasets (smart: only clones on first run, pulls after) +2. Classifies bills with fastclass and applies the results into the dataset +3. Runs the manifest's publishers (RSS feeds into the `docs/` directory) ### Other Commands ```bash -govbot clone all # download all state legislation datasets -govbot clone il ca ny # download specific states -govbot logs # stream legislative activity as JSON Lines -govbot logs | govbot tag # process and tag data -govbot build # generate RSS feeds +govbot pull all # download all state legislation datasets +govbot pull il ca ny # download specific states +govbot source # stream legislative activity as JSON Lines +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +govbot publish # run the manifest's publishers (RSS / HTML / JSON / DuckDB) +govbot run # run the full pipeline govbot load # load bill metadata into DuckDB govbot delete all # remove all downloaded data govbot update # update govbot to latest version @@ -74,22 +76,30 @@ We build snapshots off `examples`. Add examples to make a test. ## Advanced +Datasets are resolved at runtime through the **dataset registry** (see +[`REGISTRY.md`](./REGISTRY.md)). To point govbot at a custom registry: + ```bash -GOVBOT_REPO_URL_TEMPLATE="https://gitsite.com/org/{locale}.git" govbot ... +# An http(s):// URL or a local file path. +GOVBOT_REGISTRY_URL="https://example.com/registry.json" govbot pull all ``` -## Working with Logs +A project-local `.govbot/registry.json` is also honored. `govbot search` +queries the registry; `govbot pull` clones datasets once into the shared +`~/.govbot/cache/` and pins resolved commits in `govbot.lock`. + +## Working with the Record Stream -The `govbot logs` command outputs JSON Lines (JSONL) format, making it easy to pipe to tools like `jq`, `yq`, and `jl` for filtering, transformation, and pretty-printing, and even sending to AI CLI tools like `claude`. +The `govbot source` command outputs JSON Lines (JSONL) format, making it easy to pipe to tools like `jq`, `yq`, and `jl` for filtering, transformation, and pretty-printing, and even sending to AI CLI tools like `claude`. ### Basic Usage ```bash # Easiest way with smart defaults -govbot logs +govbot source # Get more args and their help -govbot logs --help +govbot source --help ``` ### modular CLI Examples @@ -100,10 +110,10 @@ Convert JSON Lines to prettified YAML: ```bash # Output prettified yaml -just govbot logs | yq -p=json -o=yaml '.' +just govbot source | yq -p=json -o=yaml '.' # Multiple documents (separated by ---) -govbot logs --repos="il" --limit=10 --filter=default | yq -p json -P +govbot source --repos="il" --limit=10 --filter=default | yq -p json -P ``` #### Filtering with `jq` @@ -112,16 +122,16 @@ Filter and transform JSON Lines: ```bash # Filter by specific fields -govbot logs| jq 'select(.log.action.classification[] == "passage")' +govbot source| jq 'select(.log.action.classification[] == "passage")' # Extract specific fields -govbot logs | jq '{bill_id: .log.bill_id, date: .log.action.date, description: .log.action.description}' +govbot source | jq '{bill_id: .log.bill_id, date: .log.action.date, description: .log.action.description}' # Count by bill -govbot logs | jq -s 'group_by(.log.bill_id) | map({bill_id: .[0].log.bill_id, count: length})' +govbot source | jq -s 'group_by(.log.bill_id) | map({bill_id: .[0].log.bill_id, count: length})' # Filter by date range -govbot logs | jq 'select(.timestamp >= "20250301" and .timestamp <= "20250331")' +govbot source | jq 'select(.timestamp >= "20250301" and .timestamp <= "20250331")' ``` #### Using `jl` (JSON Lines processor) @@ -130,10 +140,10 @@ govbot logs | jq 'select(.timestamp >= "20250301" and .timestamp <= "20250331")' ```bash # Pretty print JSON Lines -govbot logs | jl +govbot source | jl # Filter with jl -govbot logs | jl 'select(.log.action.classification[] == "passage")' +govbot source | jl 'select(.log.action.classification[] == "passage")' ``` ### Combining Tools @@ -142,17 +152,17 @@ Chain multiple tools for powerful data processing: ```bash # Filter with jq, then convert to YAML -govbot logs --repos="il" --limit=100 | \ +govbot source --repos="il" --limit=100 | \ jq 'select(.log.action.classification[] == "passage")' | \ yq -p json -P # Extract and format specific fields, then output as YAML -govbot logs --repos="il" --limit=10 | \ +govbot source --repos="il" --limit=10 | \ jq '{bill: .log.bill_id, action: .log.action.description, date: .log.action.date}' | \ yq -p json -P # Aggregate data with jq, then format as YAML array -govbot logs --repos="il" --limit=100 | \ +govbot source --repos="il" --limit=100 | \ jq -s 'group_by(.log.bill_id) | map({bill_id: .[0].log.bill_id, actions: length})' | \ yq -P ``` @@ -161,16 +171,16 @@ govbot logs --repos="il" --limit=100 | \ ```bash # Find all bills with multiple actions in a single day -govbot logs --repos="il" --limit=1000 | \ +govbot source --repos="il" --limit=1000 | \ jq -s 'group_by(.log.bill_id + .timestamp) | map(select(length > 1)) | flatten' # Extract action classifications and count them -govbot logs --repos="il" --limit=1000 | \ +govbot source --repos="il" --limit=1000 | \ jq -r '.log.action.classification[]?' | \ sort | uniq -c | sort -rn # Join with bill metadata and filter by title -govbot logs --repos="il" --limit=10 --join=bill | \ +govbot source --repos="il" --limit=10 --join=bill | \ jq 'select(.bill.title | contains("Education"))' | \ yq -p json -P ``` @@ -181,61 +191,69 @@ Generate RSS feeds using the `govbot publish` command, which reads from `govbot. **Note:** The Python scripts have been replaced by a Rust implementation. Use `govbot publish` instead. -## Publishing RSS Feeds +## Publishing -Generate RSS feeds for each tag defined in `govbot.yml` using the declarative publishing system. +Publishers consume the classified result stream and emit artifacts. RSS, HTML, +JSON, and DuckDB are built-in publishers, declared in the manifest's `publish:` +map. ### Quick Start -1. **Configure `govbot.yml`** with your tags and publish settings: +1. **Configure `govbot.yml`** with your datasets, transforms, and publishers. + The tag taxonomy is NOT in `govbot.yml` — it lives in a separate fastclass + classifier bundle that `transforms.classify.classifier` references by path: ```yaml - repos: + datasets: - all - tags: - lgbtq: - description: "Legislation related to LGBTQ+ issues..." + transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier publish: - base_url: "https://yourusername.github.io/repo-name" - output_dir: "feeds" + lgbtq-feed: + type: rss + select: [lgbtq] # tag names from the classifier bundle + base_url: "https://yourusername.github.io/repo-name" + output_dir: "feeds" + pipelines: + default: + - classify + - lgbtq-feed ``` -2. **Generate RSS feed:** +2. **Run all publishers:** ```bash govbot publish ``` -3. **Generate feed for specific tags:** +3. **Run a specific publisher:** ```bash - govbot publish --tags lgbtq education + govbot publish --publisher lgbtq-feed ``` 4. **Customize output:** + ```bash govbot publish --output-dir ./feeds --limit 100 ``` -### Configuration - -The `publish:` section in `govbot.yml` supports: +### Publisher configuration -- `base_url`: Base URL for RSS feed links (required for GitHub Pages) -- `output_dir`: Directory where RSS feeds are generated (default: `feeds`) -- `limit`: Maximum entries per feed (optional) +Each entry in `publish:` declares a `type` (`rss` / `html` / `json` / `duckdb`) +plus type-specific keys: -### Per-Tag Customization - -Tags can override default RSS feed settings: - -```yaml -tags: - lgbtq: - description: "..." - rss_title: "LGBTQ+ Legislation Updates" # Optional - rss_description: "Custom description" # Optional -``` +- `select`: tag names to include — only records carrying one of these tags are + published. Tag names must exist in the classifier bundle. +- `base_url`: base URL for generated links (required for `rss`/`html`). +- `output_dir`: directory the publisher writes into (default: `docs`). +- `output_file`: the primary artifact filename. +- `title` / `description`: custom feed/index metadata. +- `limit`: maximum entries (`"none"` for unlimited). ## Using DuckDB diff --git a/actions/govbot/REGISTRY.md b/actions/govbot/REGISTRY.md new file mode 100644 index 00000000..2a35f7fa --- /dev/null +++ b/actions/govbot/REGISTRY.md @@ -0,0 +1,94 @@ +# The govbot dataset registry + +govbot resolves datasets at **runtime** through a registry — an index that +maps a dataset identifier to the git repo holding its data. This is the +"npm/docker for government data" layer: it replaces the old compiled +52-variant `WorkingLocale` enum, so adding counties, cities, or agencies is a +data change, not a recompile. + +## Identifier scheme + +A canonical identifier is `namespace/name[@channel]`: + +| Part | Meaning | +|---|---| +| `namespace` | a grouping — `us-legislation`, a county set, an agency set | +| `name` | the dataset within the namespace — `wy`, `il`, … | +| `@channel` | optional release channel / git branch (defaults to the repo's default branch) | + +**Plain jurisdiction codes stay valid.** A bare identifier with no `/` (e.g. +`wy`) is resolved against the registry's `default_namespace`, so an existing +`govbot.yml` with `datasets: [wy]` keeps working unchanged. `all` is a +reserved alias meaning "every dataset in the registry." + +Examples — all valid in `govbot.yml` / `govbot add` / `govbot pull`: + +``` +wy # bare code -> us-legislation/wy +us-legislation/wy # canonical +us-legislation/wy@main # pinned to a channel/branch +all # every dataset +``` + +## File format + +The registry is a JSON file. The bundled default lives at +`actions/govbot/data/registry.json` and is **compiled into the binary** via +`include_str!`, so a fresh install resolves the seed jurisdictions with zero +network access. + +```json +{ + "$schema_version": "govbot-registry-1", + "description": "…", + "default_namespace": "us-legislation", + "datasets": { + "us-legislation/wy": { + "git_url": "https://github.com/chn-openstates-files/wy-legislation.git", + "schema": "ocdfiles", + "path_pattern": "**/logs/*.json", + "name": "Wyoming" + } + } +} +``` + +Per-dataset fields: + +| Field | Required | Meaning | +|---|---|---| +| `git_url` | yes | the git repo the dataset's data is cloned from | +| `schema` | no | the data schema the dataset follows (e.g. `ocdfiles`) | +| `path_pattern` | no | a glob, relative to the repo root, locating the dataset's records | +| `name` | no | a human-readable display name | + +## Where the registry comes from / how it is fetched + +`Registry::load` resolves the active registry in priority order: + +1. **`GOVBOT_REGISTRY_URL`** — an `http(s)://` URL (fetched over HTTP) or a + local file path. A fetched registry is cached at `~/.govbot/registry.json`. +2. **`/.govbot/registry.json`** — a project-local registry file. +3. **The bundled default** compiled into the binary. + +This makes the registry both a shipped default and a fetchable/overridable +catalog — an open, PR-based registry repo or a hosted catalog can both be +pointed at via `GOVBOT_REGISTRY_URL`. + +## `govbot.lock` — the dataset lockfile + +`govbot.yml` declares *which* datasets a project wants; `govbot.lock` records +the *exact git commit* each resolved to. It is govbot's `Cargo.lock`. + +- **Written/updated** by `govbot pull` and `govbot run`, next to `govbot.yml`. +- **Format** — JSON; see `src/lock.rs`. Each entry pins `git_url`, `channel`, + `commit`, `cache_key`, and `resolved_at`. +- **Commit it** to the project repo for reproducible runs. + +## The shared content-addressed cache + +A dataset is cloned **once per machine** into `~/.govbot/cache/`, where +`` is `-`. A project's +`.govbot/repos/` is a symlink into that cache. A second `pull` — in this +or any other project — finds the cache populated and only fetches deltas. See +`src/cache.rs`. diff --git a/actions/govbot/TAGGING.md b/actions/govbot/TAGGING.md index 604cec8f..3c9dadbe 100644 --- a/actions/govbot/TAGGING.md +++ b/actions/govbot/TAGGING.md @@ -1,93 +1,65 @@ -# Tagging Bills with Semantic Similarity +# Classifying and tagging bills -The `govbot tag` command can automatically tag legislative logs using semantic similarity matching. +govbot does **not** classify bills itself. Classification is delegated to +**fastclass**, a standalone, self-improving text classifier that runs as an +external transform. govbot streams documents in, fastclass classifies them, and +`govbot apply` persists the results. -### How tagging works +## The pipe -- **Primary mode (embeddings)**: Uses a sentence-transformer model (`model.onnx` + `tokenizer.json`) to embed logs and tags, combining: - - **Base similarity** between the log text and each tag’s description/examples - - **Example similarity** to individual positive examples - - **Keyword boosts** from `include_keywords` / `exclude_keywords` - - **Negative examples** penalties via `negative_examples` -- **Fallback mode (keywords only)**: If the embedding model or tokenizer cannot be loaded, govbot falls back to **keyword-based tagging** using `include_keywords` / `exclude_keywords` from the tag definitions. - -In both modes, each tag has a **`threshold`** and a structured **score breakdown** is stored in per-tag `.tag.json` files. - -## Quick Start - -1. **Place required files in your working directory:** - - - `govbot.yml` – Tag definitions (see below) - - `model.onnx` – ONNX sentence transformer model (e.g., all-MiniLM-L6-v2) - - `tokenizer.json` – Tokenizer file for the model - -2. **Run the command:** - - ```bash - just govbot logs --repos il --limit 10 | just govbot tag - ``` - -govbot will: - -- Require `govbot.yml` -- Try to use **embedding mode** (`model.onnx` + `tokenizer.json`) -- If embeddings are unavailable or fail to initialize, automatically **fall back to keyword-based matching** (using `include_keywords` / `exclude_keywords`). +```bash +govbot source --select docs | fastclass classify - classifier=./classifier | govbot apply +``` -## Tag Configuration (`govbot.yml`) +- **`govbot source --select docs`** emits one `{"id","text","kind":"docs"}` + document per bill carrying the **full bill text** from `metadata.json`. The + `id` is the bill's dataset path, which routes the result back to the right + place. +- **`fastclass classify -`** scores each document against a **classifier + bundle** — a fastclass-native directory (`classifier.yml` + `fusion.yml` + + `eval/`). govbot passes only the bundle path; it never reads the bundle. +- **`govbot apply`** reads fastclass's result JSON from stdin and writes per-tag + `.tag.json` files into the dataset. It classifies nothing — it is purely the + persistence sink. `govbot publish` later turns those files into feeds. -Each tag defines (YAML schema): +`govbot run` (or bare `govbot`) orchestrates this whole pipe automatically from +the manifest's `transforms:`/`pipelines:`. -- `name`: Tag identifier (key name in `tags:` map) -- `description`: Semantic description of what the tag represents -- `threshold`: Minimum similarity score (0.0–1.0) to match -- `examples`: Optional positive example phrases (improves embeddings) -- `include_keywords`: Phrases whose presence should strongly favor this tag -- `exclude_keywords`: Phrases that should block this tag -- `negative_examples`: Texts that should **not** match this tag (used as embedding negatives) +## The manifest declares the transform — not the taxonomy -Example: +`govbot.yml` is a project **manifest**. It has **no `tags:` block**. The +classify transform is declared under `transforms:` and points at a fastclass +classifier bundle by path: ```yaml -tags: - education: - description: > - Legislation related to schools, education funding, curriculum standards, - teacher certification, higher education policy, student loans, charter schools - threshold: 0.6 - examples: - - School funding bill - - Teacher certification requirements - include_keywords: - - education - - school funding - - curriculum - exclude_keywords: - - driver education - negative_examples: - - Resolution honoring local high school sports teams +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier # path to the fastclass bundle (classifier.yml) ``` -## Getting the Model Files - -To use embedding mode, you need: +The tag taxonomy — descriptions, examples, keywords, thresholds, fusion +weights — lives entirely inside the fastclass classifier bundle's +`classifier.yml`, owned and versioned separately. See the fastclass docs and +its Claude Code plugin (`/fastclass:improve`, `/fastclass:ratify`) for building +and improving a bundle. -1. **ONNX Model**: Convert a sentence transformer model to ONNX +## Prerequisite - ```bash - # Using optimum-cli (requires Python) - pip install optimum[onnxruntime] - optimum-cli export onnx --model sentence-transformers/all-MiniLM-L6-v2 minilm-l6-v2-onnx/ - ``` +The `fastclass` binary must be resolvable on `PATH`, `~/.cargo/bin`, or +`~/.govbot/bin`: -2. **Tokenizer**: The `tokenizer.json` file is included in the exported model directory. - -3. **Copy files**: Place `model.onnx` and `tokenizer.json` in your working directory (or in the directory pointed to by `--govbot-dir` / `GOVBOT_DIR`). +```bash +cd && cargo install --path . +``` -If either file is missing or cannot be loaded, govbot will **still run** using the keyword-based fallback described above. +govbot's transform runner resolves transform binaries the same way. ## Output -Tagged results are written to per-tag files under the session’s `tags/` directory: +`govbot apply` writes per-tag files under each session's `tags/` directory: ```text country:us/state:{state}/sessions/{session_id}/tags/{tag_name}.tag.json @@ -95,15 +67,9 @@ country:us/state:{state}/sessions/{session_id}/tags/{tag_name}.tag.json Each `{tag_name}.tag.json` file contains: -- `metadata`: Model info, last run timestamp, hash of the tag config -- `tag_config`: The tag definition as used on the last run -- `text_cache`: Deduplicated bill/log texts keyed by content hash -- `bills`: Map of bill identifiers to their `ScoreBreakdown` - -`ScoreBreakdown` includes: +- `metadata`: classifier info, last-run timestamp, tag-config hash +- `tag_config`: a stub tag definition (the real taxonomy lives in the bundle) +- `text_cache`: deduplicated bill texts keyed by content hash +- `bills`: a map of bill identifiers to their `ScoreBreakdown` -- `final_score`: Final score used for threshold comparison -- `base_embedding`: Base embedding similarity (if embeddings were used) -- `example_similarity`: Max similarity to positive examples -- `keyword_match`: Whether include_keywords matched -- `negative_penalty`: Penalty applied from negative examples (if any) +`ScoreBreakdown.final_score` is fastclass's calibrated probability for the tag. diff --git a/actions/govbot/action.yml b/actions/govbot/action.yml index cc0890f1..ee80424f 100644 --- a/actions/govbot/action.yml +++ b/actions/govbot/action.yml @@ -1,24 +1,21 @@ name: "Govbot" -description: "Clone repos, tag bills, and create RSS feeds from govbot.yml configuration" +description: "Pull datasets, classify bills with fastclass, and publish feeds from a govbot.yml manifest" branding: icon: 'rss' color: 'orange' inputs: - tags: - description: 'Comma-separated list of tags to include in feed (default: all tags from govbot.yml)' - required: false limit: - description: 'Limit number of entries per feed' + description: 'Limit number of entries per published artifact (use "none" for all)' required: false output-dir: - description: 'Output directory for RSS feed (default: from govbot.yml build.output_dir)' + description: 'Output directory override for publishers' required: false output-file: - description: 'Output filename for RSS feed (default: from govbot.yml build.output_file)' + description: 'Output filename override for publishers' required: false govbot-dir: - description: 'Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var)' + description: 'Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var)' required: false outputs: @@ -39,7 +36,7 @@ runs: curl -fsSL "https://github.com/chihacknight/govbot/releases/download/nightly/govbot-linux-x86_64" \ -o "${{ github.action_path }}/bin/govbot" chmod +x "${{ github.action_path }}/bin/govbot" - + - name: Set GOVBOT_DIR id: set-govbot-dir shell: bash @@ -52,8 +49,8 @@ runs: fi echo "GOVBOT_DIR=$GOVBOT_DIR" >> $GITHUB_ENV echo "repos-dir=$GOVBOT_DIR/repos" >> $GITHUB_OUTPUT - - - name: Restore repos cache + + - name: Restore datasets cache id: cache-repos uses: actions/cache@v4 with: @@ -61,110 +58,88 @@ runs: key: govbot-repos-${{ runner.os }}-${{ hashFiles('govbot.yml') }} restore-keys: | govbot-repos-${{ runner.os }}- - + - name: Debug cache status shell: bash run: | echo "Cache hit: ${{ steps.cache-repos.outputs.cache-hit }}" if [ "${{ steps.cache-repos.outputs.cache-hit }}" == "true" ]; then - echo "✅ Using cached repos - will only update existing repos" + echo "Using cached datasets - will only update existing datasets" else - echo "❌ Cache miss - will clone all repos" + echo "Cache miss - will pull all datasets" fi - - - name: Clone legislation repositories - id: clone + + - name: Pull legislation datasets + id: pull shell: bash working-directory: ${{ github.workspace }} run: | - # Check if govbot.yml exists + # govbot.yml must exist at the repository root. if [ ! -f "${{ github.workspace }}/govbot.yml" ]; then echo "::error::govbot.yml not found in repository root" exit 1 fi - - # If cache was hit, govbot clone will just update existing repos (git pull) - # If cache was missed, govbot clone will do fresh clones - # When no repos are specified, govbot clone updates existing repos only + + # Cache hit: `govbot pull` (no args) updates existing datasets only. + # Cache miss: `govbot pull all` does fresh clones. if [ "${{ steps.cache-repos.outputs.cache-hit }}" == "true" ]; then - echo "📥 Cache hit - updating existing repositories..." - # Update existing repos (no args = update existing only) - ${{ github.action_path }}/bin/govbot clone \ + echo "Cache hit - updating existing datasets..." + ${{ github.action_path }}/bin/govbot pull \ --govbot-dir "$GOVBOT_DIR" || true else - echo "📥 Cache miss - cloning all repositories..." - # Clone all repos - ${{ github.action_path }}/bin/govbot clone all \ + echo "Cache miss - pulling all datasets..." + ${{ github.action_path }}/bin/govbot pull all \ --govbot-dir "$GOVBOT_DIR" || true fi - - - name: Save repos cache + + - name: Save datasets cache if: steps.cache-repos.outputs.cache-hit != 'true' uses: actions/cache@v4 with: path: ${{ steps.set-govbot-dir.outputs.repos-dir }} key: govbot-repos-${{ runner.os }}-${{ hashFiles('govbot.yml') }} - - - name: Tag bills + + - name: Classify and apply shell: bash working-directory: ${{ github.workspace }} run: | - echo "🏷️ Tagging bills..." - ${{ github.action_path }}/bin/govbot logs | ${{ github.action_path }}/bin/govbot tag || true - - - name: Generate RSS feed + # Classification is delegated to fastclass, an external transform that + # must be resolvable on PATH / ~/.cargo/bin / ~/.govbot/bin. The + # classifier bundle path is declared in govbot.yml under + # transforms..classifier. + echo "Classifying bills (source | fastclass classify | apply)..." + ${{ github.action_path }}/bin/govbot source --select docs \ + | fastclass classify - \ + | ${{ github.action_path }}/bin/govbot apply || true + + - name: Publish feeds id: publish shell: bash working-directory: ${{ github.workspace }} run: | - # Build command arguments ARGS="" - - # Add tags if specified - if [ -n "${{ inputs.tags }}" ]; then - # Convert comma-separated to space-separated - TAGS=$(echo "${{ inputs.tags }}" | tr ',' ' ') - ARGS="$ARGS --tags $TAGS" - fi - - # Add limit if specified if [ -n "${{ inputs.limit }}" ]; then ARGS="$ARGS --limit ${{ inputs.limit }}" fi - - # Add output directory if specified if [ -n "${{ inputs.output-dir }}" ]; then ARGS="$ARGS --output-dir ${{ inputs.output-dir }}" fi - - # Add output file if specified if [ -n "${{ inputs.output-file }}" ]; then ARGS="$ARGS --output-file ${{ inputs.output-file }}" fi - - # Add govbot-dir if specified if [ -n "${{ inputs.govbot-dir }}" ]; then ARGS="$ARGS --govbot-dir ${{ inputs.govbot-dir }}" fi - - # Run build command - # govbot.yml is automatically found in workspace root - ${{ github.action_path }}/bin/govbot build $ARGS - - # Determine output path (read from govbot.yml or use defaults) - if [ -n "${{ inputs.output-dir }}" ]; then - OUTPUT_DIR="${{ inputs.output-dir }}" - else - # Try to read from govbot.yml, default to docs - OUTPUT_DIR=$(grep -A 5 "^build:" "${{ github.workspace }}/govbot.yml" | grep "output_dir:" | awk '{print $2}' | tr -d '"' || echo "docs") - fi - - if [ -n "${{ inputs.output-file }}" ]; then - OUTPUT_FILE="${{ inputs.output-file }}" - else - # Try to read from govbot.yml, default to feed.xml - OUTPUT_FILE=$(grep -A 5 "^build:" "${{ github.workspace }}/govbot.yml" | grep "output_file:" | awk '{print $2}' | tr -d '"' || echo "feed.xml") - fi - + + # Run the manifest's publishers. govbot.yml is found in the workspace root. + ${{ github.action_path }}/bin/govbot publish $ARGS + + # Determine the primary output path from the override inputs, defaulting + # to docs/feed.xml (the wizard's default RSS publisher). + OUTPUT_DIR="${{ inputs.output-dir }}" + [ -z "$OUTPUT_DIR" ] && OUTPUT_DIR="docs" + OUTPUT_FILE="${{ inputs.output-file }}" + [ -z "$OUTPUT_FILE" ] && OUTPUT_FILE="feed.xml" + echo "feed-path=${{ github.workspace }}/$OUTPUT_DIR/$OUTPUT_FILE" >> $GITHUB_OUTPUT echo "feed-dir=${{ github.workspace }}/$OUTPUT_DIR" >> $GITHUB_OUTPUT diff --git a/actions/govbot/data/registry.json b/actions/govbot/data/registry.json new file mode 100644 index 00000000..e88f79e2 --- /dev/null +++ b/actions/govbot/data/registry.json @@ -0,0 +1,62 @@ +{ + "$schema_version": "govbot-registry-1", + "description": "The govbot dataset registry. Maps a dataset identifier to the git repo that holds its data, the data schema it follows, and the glob that locates its records within the repo. Datasets are git repos; this index is 'npm/docker for government data'. Bundled as a default in the govbot binary and overridable from a URL via GOVBOT_REGISTRY_URL. See actions/govbot/REGISTRY.md.", + "default_namespace": "us-legislation", + "datasets": { + "us-legislation/al": {"git_url": "https://github.com/chn-openstates-files/al-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Alabama"}, + "us-legislation/ak": {"git_url": "https://github.com/chn-openstates-files/ak-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Alaska"}, + "us-legislation/az": {"git_url": "https://github.com/chn-openstates-files/az-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Arizona"}, + "us-legislation/ar": {"git_url": "https://github.com/chn-openstates-files/ar-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Arkansas"}, + "us-legislation/ca": {"git_url": "https://github.com/chn-openstates-files/ca-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "California"}, + "us-legislation/co": {"git_url": "https://github.com/chn-openstates-files/co-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Colorado"}, + "us-legislation/ct": {"git_url": "https://github.com/chn-openstates-files/ct-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Connecticut"}, + "us-legislation/de": {"git_url": "https://github.com/chn-openstates-files/de-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Delaware"}, + "us-legislation/fl": {"git_url": "https://github.com/chn-openstates-files/fl-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Florida"}, + "us-legislation/ga": {"git_url": "https://github.com/chn-openstates-files/ga-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Georgia"}, + "us-legislation/hi": {"git_url": "https://github.com/chn-openstates-files/hi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Hawaii"}, + "us-legislation/id": {"git_url": "https://github.com/chn-openstates-files/id-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Idaho"}, + "us-legislation/il": {"git_url": "https://github.com/chn-openstates-files/il-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Illinois"}, + "us-legislation/in": {"git_url": "https://github.com/chn-openstates-files/in-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Indiana"}, + "us-legislation/ia": {"git_url": "https://github.com/chn-openstates-files/ia-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Iowa"}, + "us-legislation/ks": {"git_url": "https://github.com/chn-openstates-files/ks-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Kansas"}, + "us-legislation/ky": {"git_url": "https://github.com/chn-openstates-files/ky-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Kentucky"}, + "us-legislation/la": {"git_url": "https://github.com/chn-openstates-files/la-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Louisiana"}, + "us-legislation/me": {"git_url": "https://github.com/chn-openstates-files/me-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Maine"}, + "us-legislation/md": {"git_url": "https://github.com/chn-openstates-files/md-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Maryland"}, + "us-legislation/ma": {"git_url": "https://github.com/chn-openstates-files/ma-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Massachusetts"}, + "us-legislation/mi": {"git_url": "https://github.com/chn-openstates-files/mi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Michigan"}, + "us-legislation/mn": {"git_url": "https://github.com/chn-openstates-files/mn-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Minnesota"}, + "us-legislation/ms": {"git_url": "https://github.com/chn-openstates-files/ms-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Mississippi"}, + "us-legislation/mo": {"git_url": "https://github.com/chn-openstates-files/mo-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Missouri"}, + "us-legislation/mt": {"git_url": "https://github.com/chn-openstates-files/mt-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Montana"}, + "us-legislation/ne": {"git_url": "https://github.com/chn-openstates-files/ne-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Nebraska"}, + "us-legislation/nv": {"git_url": "https://github.com/chn-openstates-files/nv-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Nevada"}, + "us-legislation/nh": {"git_url": "https://github.com/chn-openstates-files/nh-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New Hampshire"}, + "us-legislation/nj": {"git_url": "https://github.com/chn-openstates-files/nj-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New Jersey"}, + "us-legislation/nm": {"git_url": "https://github.com/chn-openstates-files/nm-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New Mexico"}, + "us-legislation/ny": {"git_url": "https://github.com/chn-openstates-files/ny-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "New York"}, + "us-legislation/nc": {"git_url": "https://github.com/chn-openstates-files/nc-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "North Carolina"}, + "us-legislation/nd": {"git_url": "https://github.com/chn-openstates-files/nd-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "North Dakota"}, + "us-legislation/oh": {"git_url": "https://github.com/chn-openstates-files/oh-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Ohio"}, + "us-legislation/ok": {"git_url": "https://github.com/chn-openstates-files/ok-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Oklahoma"}, + "us-legislation/or": {"git_url": "https://github.com/chn-openstates-files/or-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Oregon"}, + "us-legislation/pa": {"git_url": "https://github.com/chn-openstates-files/pa-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Pennsylvania"}, + "us-legislation/ri": {"git_url": "https://github.com/chn-openstates-files/ri-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Rhode Island"}, + "us-legislation/sc": {"git_url": "https://github.com/chn-openstates-files/sc-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "South Carolina"}, + "us-legislation/sd": {"git_url": "https://github.com/chn-openstates-files/sd-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "South Dakota"}, + "us-legislation/tn": {"git_url": "https://github.com/chn-openstates-files/tn-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Tennessee"}, + "us-legislation/tx": {"git_url": "https://github.com/chn-openstates-files/tx-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Texas"}, + "us-legislation/ut": {"git_url": "https://github.com/chn-openstates-files/ut-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Utah"}, + "us-legislation/vt": {"git_url": "https://github.com/chn-openstates-files/vt-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Vermont"}, + "us-legislation/va": {"git_url": "https://github.com/chn-openstates-files/va-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Virginia"}, + "us-legislation/wa": {"git_url": "https://github.com/chn-openstates-files/wa-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Washington"}, + "us-legislation/wv": {"git_url": "https://github.com/chn-openstates-files/wv-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "West Virginia"}, + "us-legislation/wi": {"git_url": "https://github.com/chn-openstates-files/wi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Wisconsin"}, + "us-legislation/wy": {"git_url": "https://github.com/chn-openstates-files/wy-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Wyoming"}, + "us-legislation/pr": {"git_url": "https://github.com/chn-openstates-files/pr-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Puerto Rico"}, + "us-legislation/mp": {"git_url": "https://github.com/chn-openstates-files/mp-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Northern Mariana Islands"}, + "us-legislation/vi": {"git_url": "https://github.com/chn-openstates-files/vi-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "U.S. Virgin Islands"}, + "us-legislation/gu": {"git_url": "https://github.com/chn-openstates-files/gu-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "Guam"}, + "us-legislation/usa": {"git_url": "https://github.com/chn-openstates-files/usa-legislation.git", "schema": "ocdfiles", "path_pattern": "**/logs/*.json", "name": "United States Congress"} + } +} diff --git a/actions/govbot/examples/govbot-clone-list.sh b/actions/govbot/examples/govbot-clone-list.sh deleted file mode 100644 index 843c33e2..00000000 --- a/actions/govbot/examples/govbot-clone-list.sh +++ /dev/null @@ -1 +0,0 @@ -govbot clone --list \ No newline at end of file diff --git a/actions/govbot/examples/govbot-pull-list.sh b/actions/govbot/examples/govbot-pull-list.sh new file mode 100644 index 00000000..182ab9d1 --- /dev/null +++ b/actions/govbot/examples/govbot-pull-list.sh @@ -0,0 +1 @@ +govbot pull --list \ No newline at end of file diff --git a/actions/govbot/examples/logs-basic.sh b/actions/govbot/examples/logs-basic.sh deleted file mode 100644 index 47118436..00000000 --- a/actions/govbot/examples/logs-basic.sh +++ /dev/null @@ -1 +0,0 @@ -govbot logs diff --git a/actions/govbot/examples/source-basic.sh b/actions/govbot/examples/source-basic.sh new file mode 100644 index 00000000..caa69b00 --- /dev/null +++ b/actions/govbot/examples/source-basic.sh @@ -0,0 +1 @@ +govbot source diff --git a/actions/govbot/justfile b/actions/govbot/justfile index 54ed620e..6fde2f32 100644 --- a/actions/govbot/justfile +++ b/actions/govbot/justfile @@ -56,9 +56,9 @@ build-release: # Usage: just govbot [COMMAND] [ARGS...] # Examples: # just govbot --help -# just govbot clone usa il -# just govbot clone --govbot-dir custom-dir usa -# just govbot logs --repos usa +# just govbot pull usa il +# just govbot pull --govbot-dir custom-dir usa +# just govbot source --repos usa govbot *ARGS: #!/usr/bin/env bash set -e @@ -140,38 +140,33 @@ run: run-args ARGS: cargo run -- {{ARGS}} -# Tag bills using AI (reads JSON lines from stdin) -# Usage: just govbot logs --repos il --limit 10 | just govbot tag --ai-tool "ollama run llama3" -# Example: just govbot logs --repos il --limit 10 | just govbot tag --ai-tool "ollama" -# Note: The tag command reads from stdin, so pipe the logs output to it -tag *ARGS: +# Apply fastclass classification results (reads result JSON from stdin) +# Usage: just govbot source --select docs | fastclass classify - | just apply +# Note: The apply command reads from stdin, so pipe the classify output to it +apply *ARGS: #!/usr/bin/env bash set -e - + DEV_DIR=".govbot" - + # Build release binary if it doesn't exist or if any source files are newer if [ ! -f "target/release/govbot" ] || [ "src" -nt "target/release/govbot" ] || find src -name "*.rs" -newer "target/release/govbot" 2>/dev/null | grep -q .; then echo "🔨 Building release target..." cargo build --release fi - + # Check if --govbot-dir is already in the arguments ARGS_STR="{{ARGS}}" if [[ "$ARGS_STR" =~ --govbot-dir ]]; then - ./target/release/govbot tag {{ARGS}} + ./target/release/govbot apply {{ARGS}} else - ./target/release/govbot tag {{ARGS}} --govbot-dir "$DEV_DIR" + ./target/release/govbot apply {{ARGS}} --govbot-dir "$DEV_DIR" fi # Run the release binary run-release: cargo run --release -# Generate locale enum from pipeline-manager config -generate: - cargo run --bin generate-locale-enum - # Update mocks by cloning/pulling repos and cleaning them up # Usage: just mocks [LOCALES...] # Example: just mocks usa il @@ -199,10 +194,10 @@ mocks *LOCALES: # Ensure binary is built cargo build --bin govbot - # Clone/pull repositories using govbot + # Pull datasets using govbot echo "" - echo "📥 Cloning/pulling repositories..." - cargo run --bin govbot -- clone --govbot-dir "$MOCKS_DIR" $LOCALES + echo "📥 Pulling datasets..." + cargo run --bin govbot -- pull --govbot-dir "$MOCKS_DIR" $LOCALES # Cleanup functions function delete_files_dir { diff --git a/actions/govbot/src/bin/generate-locale-enum.rs b/actions/govbot/src/bin/generate-locale-enum.rs deleted file mode 100644 index 6573339a..00000000 --- a/actions/govbot/src/bin/generate-locale-enum.rs +++ /dev/null @@ -1,204 +0,0 @@ -//! Generate a Rust enum from working locales in pipeline-manager config.yml -//! Run with: cargo run --bin generate-locale-enum - -use serde::Deserialize; -use std::collections::HashMap; -use std::fs; -use std::path::PathBuf; - -#[derive(Debug, Deserialize)] -struct Config { - locales: HashMap, -} - -#[derive(Debug, Deserialize)] -struct LocaleConfig { - #[allow(dead_code)] - template: String, - #[serde(default)] - labels: Vec, -} - -fn locale_to_variant(locale: &str) -> String { - // Two-letter codes should be uppercase (e.g., 'ar' -> 'AR', 'pr' -> 'PR') - // Longer codes should be capitalized (e.g., 'usa' -> 'Usa') - if locale.len() <= 2 { - locale.to_uppercase() - } else { - let mut chars = locale.chars(); - match chars.next() { - None => String::new(), - Some(first) => first.to_uppercase().collect::() + chars.as_str(), - } - } -} - -fn get_working_locales(config_path: &PathBuf) -> Result, Box> { - let content = fs::read_to_string(config_path)?; - let config: Config = serde_yaml::from_str(&content)?; - - let mut working_locales: Vec = config - .locales - .into_iter() - .filter(|(_, locale_config)| locale_config.labels.contains(&"working".to_string())) - .map(|(locale, _)| locale) - .collect(); - - working_locales.sort(); - Ok(working_locales) -} - -fn generate_rust_enum( - locales: &[String], - output_path: &PathBuf, -) -> Result<(), Box> { - // Create mapping of locale -> variant name - let locale_variants: Vec<(String, String)> = locales - .iter() - .map(|loc| (loc.clone(), locale_to_variant(loc))) - .collect(); - - // Generate enum variants - let enum_variants: Vec = locale_variants - .iter() - .map(|(_, variant)| format!(" {},", variant)) - .collect(); - - // Generate match arms for as_str - let as_str_arms: Vec = locale_variants - .iter() - .map(|(locale, variant)| { - format!(" WorkingLocale::{} => \"{}\",", variant, locale) - }) - .collect(); - - // Generate match arms for as_lowercase - let as_lowercase_arms: Vec = locale_variants - .iter() - .map(|(locale, variant)| { - format!( - " WorkingLocale::{} => \"{}\",", - variant, - locale.to_lowercase() - ) - }) - .collect(); - - // Generate match arms for From<&str> - let from_str_arms: Vec = locale_variants - .iter() - .map(|(locale, variant)| { - format!( - " \"{}\" => WorkingLocale::{},", - locale.to_lowercase(), - variant - ) - }) - .collect(); - - // Generate all() vector items - let all_items: Vec = locale_variants - .iter() - .map(|(_, variant)| format!(" WorkingLocale::{},", variant)) - .collect(); - - let rust_code = format!( - r#"//! Auto-generated locale enum from pipeline-manager config.yml -//! This file is generated by src/bin/generate-locale-enum.rs -//! Do not edit manually - regenerate using: just generate - -/// Locale codes for working pipelines -#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)] -#[serde(rename_all = "lowercase")] -pub enum WorkingLocale {{ - All, -{} -}} - -impl WorkingLocale {{ - /// Get all working locales as a vector (excludes All variant) - pub fn all() -> Vec {{ - vec![ -{} - ] - }} - - /// Get the locale code as a string - pub fn as_str(&self) -> &'static str {{ - match self {{ - WorkingLocale::All => "all", -{} - }} - }} - - /// Get the locale code in lowercase - pub fn as_lowercase(&self) -> &'static str {{ - match self {{ - WorkingLocale::All => "all", -{} - }} - }} -}} - -impl From<&str> for WorkingLocale {{ - fn from(s: &str) -> Self {{ - match s.to_lowercase().as_str() {{ - "all" => WorkingLocale::All, -{} - _ => panic!("Invalid working locale: {{}}", s), - }} - }} -}} - -impl std::fmt::Display for WorkingLocale {{ - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {{ - write!(f, "{{}}", self.as_lowercase()) - }} -}} -"#, - enum_variants.join("\n"), - all_items.join("\n"), - as_str_arms.join("\n"), - as_lowercase_arms.join("\n"), - from_str_arms.join("\n") - ); - - fs::write(output_path, rust_code)?; - Ok(()) -} - -fn main() -> Result<(), Box> { - // Get paths relative to the binary location - // manifest_dir is /Users/sartaj/Git/toolkit/actions/govbot - // config is at /Users/sartaj/Git/toolkit/actions/pipeline-manager/chn-openstates-files.yml - let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); - let config_path = manifest_dir - .parent() - .unwrap() // /Users/sartaj/Git/toolkit/actions - .join("pipeline-manager") - .join("chn-openstates-files.yml"); - let output_path = manifest_dir.join("src").join("locale_generated.rs"); - - if !config_path.exists() { - eprintln!( - "❌ Error: Config file not found at {}", - config_path.display() - ); - std::process::exit(1); - } - - let locales = get_working_locales(&config_path)?; - if locales.is_empty() { - eprintln!("⚠️ Warning: No working locales found in config"); - } - - generate_rust_enum(&locales, &output_path)?; - println!( - "✅ Generated {} with {} working locales", - output_path.display(), - locales.len() - ); - println!("📋 Working locales: {}", locales.join(", ")); - - Ok(()) -} diff --git a/actions/govbot/src/bluesky.rs b/actions/govbot/src/bluesky.rs new file mode 100644 index 00000000..376918b5 --- /dev/null +++ b/actions/govbot/src/bluesky.rs @@ -0,0 +1,1436 @@ +//! The `bluesky` publisher — posts matched bills to a Bluesky account. +//! +//! This is a **posting bot**, not a hosted AT-Protocol feed-generator service: +//! it posts to a normal Bluesky account via the XRPC API and runs to +//! completion (from cron/CI), so it needs no always-on server. +//! +//! Flow: +//! 1. Authenticate — `com.atproto.server.createSession` with the account +//! handle + an **app password** read from the environment. +//! 2. Select records — keep those carrying a `select`ed tag whose calibrated +//! `final_score` clears `min_score`. +//! 3. For each record not already in the ledger, render a post (<=300 +//! chars) and `com.atproto.repo.createRecord` an `app.bsky.feed.post`. +//! 4. Append the record's id to the posted-state ledger so re-runs never +//! double-post. +//! +//! `--dry-run` renders the posts that *would* be sent and touches no network +//! and no ledger. +//! +//! Credentials are **environment-only** — never read from `govbot.yml`: +//! - `BLUESKY_HANDLE` — the account handle, e.g. `mybot.bsky.social` +//! - `BLUESKY_APP_PASSWORD` — an app password (Settings → App Passwords), +//! never the main account password +//! - `BLUESKY_SERVICE` — optional PDS base URL (default `https://bsky.social`) +//! +//! ### `{link}` resolution +//! +//! `{link}` in `post_template` resolves with this priority: +//! 1. the manifest's companion `html` publisher's `base_url` — the +//! human-readable landing page activists actually want to click through +//! to (computed once in `run_publish_command` and passed in via +//! `PublishJob::html_entry_url`); +//! 2. the bluesky publisher's own `base_url` joined to the bill's dataset +//! `sources.bill` path — the historical default, which points at the +//! raw `metadata.json` file (rarely what an activist wants); +//! 3. the bill's `bill.sources[0].url` (the upstream legislature page). +//! +//! Declaring an `html` publisher alongside `bluesky` is what makes the +//! default useful. See AGENT.md §2.2. + +use crate::publish::PublishJob; +use anyhow::{Context, Result}; +use serde_json::{json, Value}; +use std::collections::{HashMap, HashSet}; +use std::fs; +use std::io::Write; +use std::path::{Path, PathBuf}; + +/// Bluesky's hard post-text limit (graphemes; we approximate with chars). +const POST_TEXT_LIMIT: usize = 300; + +/// Default PDS service endpoint when `BLUESKY_SERVICE` is unset. +const DEFAULT_SERVICE: &str = "https://bsky.social"; + +/// The default post-text template. Kept deliberately simple — a future +/// `summarize` transform will improve framing. +const DEFAULT_TEMPLATE: &str = "{title}\n\n{tags} · {link}"; + +/// A post ready to be sent: the routing key (ledger id) plus rendered text. +#[derive(Debug)] +struct RenderedPost { + /// The ledger key — a stable per-record id (the entry GUID). + id: String, + /// The post body, already truncated to the Bluesky limit. + text: String, +} + +/// Run the `bluesky` publisher against its result stream. +/// +/// `dry_run` renders the would-be posts and touches no network and no ledger. +pub fn run_bluesky(job: &PublishJob, dry_run: bool) -> Result<()> { + let p = job.publisher; + let select = p.select.clone().unwrap_or_default(); + let min_score = p.resolved_min_score(); + + // Resolve the ledger path (project-dir relative). Default: a per-publisher + // file under `state/`. The legacy `.govbot/`-rooted path is consulted as + // a read-only fallback for projects that ran a pre-fix govbot, so a + // version bump doesn't lose post history; see `resolve_ledger_path`. + let ledger_path = resolve_ledger_path(job); + let legacy_path = legacy_ledger_path(job); + + // Dedup-then-filter, by **bill** (jurisdiction, bill_id), not by + // action-log. A bill emits one record per action-log file (committee + // referral, hearing, passage vote …); without this collapse, an + // activist sees the same bill posted N times in a row (NV AB1 6×, + // AK HB53 4× on the climate-tracker feed before the fix). The rule: + // + // 1. group every entry by `bill_guid` (no score filter yet); + // 2. within each group pick the **highest-scoring qualifying log** + // as the representative — so a bill counts when *any* of its + // logs cleared `min_score` for a selected tag, and the post we + // render is the strongest one; + // 3. drop bills whose every log scored under threshold. + // + // `{link}` resolves with this priority: + // 1. the companion `html` publisher's landing-page URL (the human page); + // 2. the bill's `bill.sources[0].url` (the upstream legislature page); + // 3. the bluesky publisher's own `base_url` joined to the bill source + // path (the historical default — `metadata.json`, the JSON file). + // Most useful default with no new manifest surface: when the manifest + // carries an html publisher, route activists to that human page rather + // than to the raw JSON that the rss/html publishers' `extract_link` + // emits. + let representatives = pick_per_bill_representatives(&job.entries, &select, min_score); + let posts: Vec = representatives + .into_iter() + .map(|e| { + render_post( + e, + p.post_template.as_deref(), + p.base_url.as_deref(), + job.html_entry_url.as_deref(), + ) + }) + .collect(); + + if posts.is_empty() { + eprintln!( + "Publisher '{}' (bluesky): no records cleared min_score {} for tags {} — nothing to post.", + job.name, + min_score, + if select.is_empty() { "".to_string() } else { select.join(", ") } + ); + return Ok(()); + } + + // Idempotency: drop records already in the posted-state ledger. The set + // is the union of the new (`state/`) ledger and the legacy (`.govbot/`) + // ledger so an upgrading project doesn't double-post records it logged + // under the old path. Writes only land at the new path; the legacy file + // becomes harmless once a full re-run has copied its contents forward. + // + // Both shapes of ledger entry are honoured: + // - **New** (bill-level GUID, `/.../bills/`) — matched + // verbatim against the post's bill-level id. + // - **Legacy** (per-log GUID) — collapsed via + // `ledger_id_to_bill_key` on read. Per-bill-log layout entries + // (`/.../bills//logs/`) suppress re-posts + // cleanly. Session-level-log layout entries + // (`/.../sessions//logs/`) — the OCD-files + // common case — strip to the session prefix and incur a + // documented one-time re-post per previously-posted bill (after + // which the new bill-level GUID is in the ledger). See + // `ledger_id_to_bill_key` for the migration story. + let mut already_posted: HashSet = HashSet::new(); + for id in read_ledger(&ledger_path)? { + already_posted.insert(ledger_id_to_bill_key(&id)); + } + if ledger_path != legacy_path { + for id in read_ledger(&legacy_path)? { + already_posted.insert(ledger_id_to_bill_key(&id)); + } + } + let pending: Vec<&RenderedPost> = posts + .iter() + .filter(|post| !already_posted.contains(&post.id)) + .collect(); + + if dry_run { + eprintln!( + "Publisher '{}' (bluesky) --dry-run: {} record(s) cleared the threshold, \ + {} already posted, {} would be posted. No network, no ledger writes.", + job.name, + posts.len(), + posts.len() - pending.len(), + pending.len(), + ); + for (i, post) in pending.iter().enumerate() { + println!( + "--- post {} of {} (id: {}) ---", + i + 1, + pending.len(), + post.id + ); + println!("{}", post.text); + println!(); + } + return Ok(()); + } + + if pending.is_empty() { + eprintln!( + "Publisher '{}' (bluesky): all {} matching record(s) already posted — nothing new.", + job.name, + posts.len() + ); + return Ok(()); + } + + // Authenticate — credentials are environment-only. If they are absent, + // skip the publisher with a WARN rather than failing the whole pipeline: + // first-time activists running `govbot run` without Bluesky creds yet + // should still get their RSS / HTML feeds rather than a red error. + // Pair this with `govbot run --dry-run` to render-only without + // requiring creds at all. + if !creds_present() { + eprintln!( + "⚠️ Publisher '{}' (bluesky): BLUESKY_HANDLE / BLUESKY_APP_PASSWORD \ + not set — skipping. Set them (an app password from Bluesky \ + Settings → App Passwords) to go live; or use `govbot run \ + --dry-run` / `govbot publish --publisher {} --dry-run` to \ + render-only.", + job.name, job.name + ); + return Ok(()); + } + let service = std::env::var("BLUESKY_SERVICE") + .ok() + .filter(|s| !s.trim().is_empty()) + .unwrap_or_else(|| DEFAULT_SERVICE.to_string()); + let session = create_session(&service).context("Bluesky authentication failed")?; + + eprintln!( + "Publisher '{}' (bluesky): authenticated as {} — posting {} record(s) to {}.", + job.name, + session.handle, + pending.len(), + service + ); + + // Post each pending record, appending to the ledger as we go so a + // mid-run failure never re-posts what already succeeded. + let mut posted = 0usize; + for post in &pending { + match create_post(&service, &session, &post.text) { + Ok(uri) => { + append_ledger(&ledger_path, &post.id)?; + posted += 1; + eprintln!(" ✓ posted {} -> {}", post.id, uri); + } + Err(e) => { + // Fail loudly but stop — leave the rest for the next run + // rather than hammering a failing endpoint. + anyhow::bail!( + "Publisher '{}' (bluesky): posted {}/{} record(s); failed on {}: {}", + job.name, + posted, + pending.len(), + post.id, + e + ); + } + } + } + + eprintln!( + "✓ Publisher '{}' (bluesky): posted {} record(s); ledger at {}", + job.name, + posted, + ledger_path.display() + ); + Ok(()) +} + +// ============================================================ +// Record selection + post rendering +// ============================================================ + +/// True when the record carries a `select`ed tag whose calibrated +/// `final_score` clears `min_score`. When `select` is empty, any tag counts. +/// +/// The `tags` field is a map `tag_name -> ScoreBreakdown`; the calibrated +/// probability is `tags..final_score` (STREAM_PROTOCOL §5). +/// +/// **Note.** The production publisher path now uses +/// [`pick_per_bill_representatives`], which folds this check into a +/// per-group walk so the per-bill dedup can pick the highest-scoring +/// qualifying log as the representative. This standalone predicate is +/// kept as the simplest unit-testable surface for the threshold rule. +#[cfg_attr(not(test), allow(dead_code))] +fn record_clears_threshold(entry: &Value, select: &[String], min_score: f64) -> bool { + let tags = match entry.get("tags").and_then(|t| t.as_object()) { + Some(t) if !t.is_empty() => t, + _ => return false, + }; + tags.iter().any(|(name, score)| { + let selected = select.is_empty() || select.iter().any(|s| s == name); + if !selected { + return false; + } + score + .get("final_score") + .and_then(|v| v.as_f64()) + .map(|s| s >= min_score) + .unwrap_or(false) + }) +} + +/// Render a record into post text, applying the template and truncating to +/// the Bluesky character limit. +/// +/// `{link}` resolution order: +/// 1. `html_entry_url` — the manifest's companion `html` publisher's +/// landing-page URL (the human-readable index activists actually want +/// to click through to); +/// 2. the bill's `bill.sources[0].url` (the upstream legislature page); +/// 3. `base_url` joined to the bill's `sources.bill` dataset path +/// (the historical default — a raw `metadata.json` link); +/// 4. empty. +/// +/// The html-publisher route is the *useful default* — without it, `{link}` +/// resolves to `//.../metadata.json`, which renders an +/// activist's reader landing on a JSON file. See Bug 7. +fn render_post( + entry: &Value, + template: Option<&str>, + base_url: Option<&str>, + html_entry_url: Option<&str>, +) -> RenderedPost { + // Ledger key — **bill-level** so future action logs for the same bill + // (new committee referrals, vote events, …) do not re-post the bill. + // Pre-fix this was the per-log GUID, which let a single bill trigger + // N posts as N action logs arrived; the migration story for already- + // posted bills is in `ledger_id_to_bill_key`. + let id = crate::rss::bill_guid(entry); + let template = template.unwrap_or(DEFAULT_TEMPLATE); + + let title = bill_title(entry); + let tags = entry + .get("tags") + .and_then(|t| t.as_object()) + .map(|m| m.keys().cloned().collect::>().join(", ")) + .unwrap_or_default(); + let link = resolve_link(entry, base_url, html_entry_url).unwrap_or_default(); + let identifier = entry + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str()) + .or_else(|| entry.get("id").and_then(|v| v.as_str())) + .unwrap_or("") + .to_string(); + let session = entry + .get("bill") + .and_then(|b| b.get("legislative_session")) + .and_then(|v| v.as_str()) + .unwrap_or("") + .to_string(); + let score = top_score(entry) + .map(|s| format!("{:.2}", s)) + .unwrap_or_default(); + + let text = template + .replace("{title}", &title) + .replace("{tags}", &tags) + .replace("{link}", &link) + .replace("{identifier}", &identifier) + .replace("{session}", &session) + .replace("{score}", &score); + + RenderedPost { + id, + text: truncate_post(&text), + } +} + +/// Resolve `{link}` for a bluesky post. +/// +/// Priority: +/// 1. the companion `html` publisher's landing-page URL — the +/// human-readable index page the manifest already promised activists +/// (the fix for Bug 7); +/// 2. the historical default — `extract_link`: bluesky's own `base_url` +/// joined to the dataset `sources.bill` path, falling back to the +/// bill's first upstream source URL. +/// +/// (1) is the useful default: without it, `{link}` pointed at the raw +/// `metadata.json` path under the bluesky `base_url`, which sent an +/// activist's reader to a JSON file. The html publisher's landing page is +/// the human page an activist actually wants to click. +fn resolve_link( + entry: &Value, + base_url: Option<&str>, + html_entry_url: Option<&str>, +) -> Option { + if let Some(url) = html_entry_url { + let trimmed = url.trim(); + if !trimmed.is_empty() { + return Some(trimmed.trim_end_matches('/').to_string()); + } + } + crate::rss::extract_link(entry, base_url) +} + +/// The highest calibrated `final_score` across a record's tags. +fn top_score(entry: &Value) -> Option { + entry + .get("tags") + .and_then(|t| t.as_object()) + .and_then(|tags| { + tags.values() + .filter_map(|s| s.get("final_score").and_then(|v| v.as_f64())) + .fold(None, |acc, s| Some(acc.map_or(s, |a: f64| a.max(s)))) + }) +} + +/// The highest calibrated `final_score` across a record's tags **restricted +/// to `select`**. When `select` is empty, every tag counts; otherwise only +/// the named tags. Returns `None` when no qualifying tag carries a score. +/// +/// This is the score used to rank logs *within a bill group* when picking +/// the representative — so a bill posts under its strongest qualifying log, +/// not (arbitrarily) under its newest one. +fn top_selected_score(entry: &Value, select: &[String]) -> Option { + entry + .get("tags") + .and_then(|t| t.as_object()) + .and_then(|tags| { + tags.iter() + .filter(|(name, _)| select.is_empty() || select.iter().any(|s| s == *name)) + .filter_map(|(_, s)| s.get("final_score").and_then(|v| v.as_f64())) + .fold(None, |acc, s| Some(acc.map_or(s, |a: f64| a.max(s)))) + }) +} + +/// Collapse an entry stream to one representative per (jurisdiction, +/// bill_id), filtering and ranking by score. +/// +/// For each bill the bluesky publisher's contract is **one post**. Inputs +/// may carry many entries for the same bill (one per action log); this +/// function: +/// +/// 1. groups by [`crate::rss::bill_guid`]; +/// 2. within each group, **keeps only logs whose top `select`ed score +/// clears `min_score`** — the bill is dropped when no log qualifies; +/// 3. picks the **highest-scoring** qualifying log as the representative +/// — ties break on stream order (the input is timestamp-sorted DESC, +/// so a tie wins for the newest log). +/// +/// Returns the representatives in **input stream order** so a downstream +/// `--limit` keeps the bills the user saw first (the newest, given the +/// upstream DESC sort). +fn pick_per_bill_representatives<'a>( + entries: &'a [Value], + select: &[String], + min_score: f64, +) -> Vec<&'a Value> { + // Map bill_guid -> index into `entries` of the current best representative, + // along with its score. A `Vec` of bill_guid in first-seen order gives + // us a deterministic output order. + let mut best: HashMap = HashMap::new(); + let mut order: Vec = Vec::new(); + + for (i, e) in entries.iter().enumerate() { + // Bill counts when *any* of its logs clears the threshold for a + // selected tag — this filter is per-log, applied during the group + // walk so a bill with zero qualifying logs simply never enters the + // map. + let Some(score) = top_selected_score(e, select) else { + continue; + }; + if score < min_score { + continue; + } + let key = crate::rss::bill_guid(e); + match best.get(&key) { + Some((_, prev_score)) if *prev_score >= score => { + // The current best beats (or ties) this log on score — keep + // the existing winner (preserves stream order on ties, which + // means newest wins since input is DESC). + } + Some(_) => { + best.insert(key, (i, score)); + } + None => { + order.push(key.clone()); + best.insert(key, (i, score)); + } + } + } + + order + .into_iter() + .filter_map(|k| best.get(&k).map(|(i, _)| &entries[*i])) + .collect() +} + +/// Collapse a (possibly per-log, legacy-shape) ledger id to its bill-level +/// key — the new ledger key shape. +/// +/// Pre-fix the ledger held per-log GUIDs of the form +/// `/.../sessions//logs/.json` (session-level layout, +/// the OCD-files common case) or +/// `/.../bills//logs/.json` (per-bill-log layout). +/// Post-fix the writer emits the **bill-level** key — always +/// `/.../bills/` — and the reader compares against that. +/// +/// This function strips `/logs/...` off either shape. Two outcomes: +/// +/// - **Per-bill-log layout** — the prefix already ends in +/// `/bills/`, so the collapse cleanly matches the new +/// bill-level key. Legacy entries from this layout suppress re-posts. +/// - **Session-level-log layout** — the prefix ends at `/sessions/` +/// with no bill segment. The legacy entry preserves a session-prefix +/// in the dedup set, but a new post's bill-level key +/// (`/bills/`) won't match it. The bill therefore +/// **re-posts once** on the first post-upgrade run, after which the +/// new bill-level GUID is in the ledger and never re-posts again. +/// +/// Pre-fix users incur at most one extra post per previously-posted bill +/// in session-level-log layouts. This is the honest migration cost; the +/// alternative — guessing the bill from a session-level log path alone — +/// would be wrong as often as it would be right (the filename does not +/// reliably encode the bill). +/// +/// Entries that do not contain `/logs/` (already bill-level, or a +/// synthetic `_` fallback) pass through unchanged. +fn ledger_id_to_bill_key(id: &str) -> String { + match id.split_once("/logs/") { + Some((prefix, _)) => prefix.to_string(), + None => id.to_string(), + } +} + +/// Best-effort bill title — the bill's `title`, else its identifier, else a +/// generic fallback. +fn bill_title(entry: &Value) -> String { + if let Some(t) = entry + .get("bill") + .and_then(|b| b.get("title")) + .and_then(|v| v.as_str()) + { + let t = t.trim(); + if !t.is_empty() { + return t.to_string(); + } + } + if let Some(id) = entry + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str()) + .or_else(|| entry.get("id").and_then(|v| v.as_str())) + { + if !id.is_empty() { + return id.to_string(); + } + } + "Legislative update".to_string() +} + +/// Truncate post text to the Bluesky limit, appending an ellipsis when cut. +fn truncate_post(text: &str) -> String { + let trimmed = text.trim(); + if trimmed.chars().count() <= POST_TEXT_LIMIT { + return trimmed.to_string(); + } + let mut out: String = trimmed.chars().take(POST_TEXT_LIMIT - 1).collect(); + // Avoid cutting mid-word where reasonable. + if let Some(idx) = out.rfind(char::is_whitespace) { + if idx > POST_TEXT_LIMIT / 2 { + out.truncate(idx); + } + } + format!("{}…", out.trim_end()) +} + +// ============================================================ +// Posted-state ledger (idempotency) +// ============================================================ + +/// Resolve the ledger file path: the publisher's `ledger` field if set, +/// else `/state/bluesky-.ledger`. Relative paths resolve +/// against the project directory (where `govbot.yml` lives). +/// +/// **Why `state/` and not `.govbot/`.** `.govbot/` is the tool's cache — +/// the `node_modules/` equivalent — and is safe to `rm -rf` to start +/// fresh. The posted-state ledger is the opposite: it is the +/// **single source of truth** for which records the bot has already +/// posted; deleting it makes the next run double-post everything. Putting +/// it under `.govbot/` invited exactly that footgun. `state/` is the +/// peer of `tags/` (classification output) and `dist/` (publisher +/// output) — an operational, non-cache dir that scales as more stateful +/// publishers land (a future `mastodon` publisher would put its ledger +/// at `state/mastodon-.ledger`). +/// +/// **Backward compatibility.** Writes always land at the new +/// `state/...` path. Reads check there first; if the file is missing, +/// they fall back to the legacy `.govbot/bluesky-.ledger` so +/// existing projects don't lose post history on upgrade. After one full +/// re-run the new ledger has everything the old one did, and the user +/// (or a future `govbot migrate`) can delete the legacy file. See +/// `read_ledger` / `legacy_ledger_path`. +fn resolve_ledger_path(job: &PublishJob) -> PathBuf { + match &job.publisher.ledger { + Some(p) => { + let p = PathBuf::from(p); + if p.is_absolute() { + p + } else { + job.project_dir.join(p) + } + } + None => job + .project_dir + .join("state") + .join(format!("bluesky-{}.ledger", job.name)), + } +} + +/// The legacy `.govbot/`-rooted ledger path. Read-only fallback for +/// projects that ran a pre-fix govbot; never written. See the doc +/// comment on `resolve_ledger_path` for the migration story. +fn legacy_ledger_path(job: &PublishJob) -> PathBuf { + job.project_dir + .join(".govbot") + .join(format!("bluesky-{}.ledger", job.name)) +} + +/// Read the set of already-posted record ids from the ledger. A missing +/// ledger is an empty set (first run). The ledger is append-only, +/// newline-delimited, one record id per line. +fn read_ledger(path: &Path) -> Result> { + if !path.exists() { + return Ok(HashSet::new()); + } + let contents = fs::read_to_string(path) + .with_context(|| format!("Failed to read posted-state ledger: {}", path.display()))?; + Ok(contents + .lines() + .map(str::trim) + .filter(|l| !l.is_empty()) + .map(|l| l.to_string()) + .collect()) +} + +/// Append a posted record id to the ledger, creating it (and its parent +/// directory) if needed. +fn append_ledger(path: &Path, id: &str) -> Result<()> { + if let Some(parent) = path.parent() { + fs::create_dir_all(parent) + .with_context(|| format!("Failed to create ledger directory: {}", parent.display()))?; + } + let mut file = fs::OpenOptions::new() + .create(true) + .append(true) + .open(path) + .with_context(|| format!("Failed to open posted-state ledger: {}", path.display()))?; + writeln!(file, "{}", id) + .with_context(|| format!("Failed to append to ledger: {}", path.display()))?; + Ok(()) +} + +// ============================================================ +// AT Protocol XRPC +// ============================================================ + +/// An authenticated Bluesky session. +struct Session { + /// The bearer access token (`accessJwt`). + access_jwt: String, + /// The repo DID — the record owner for `createRecord`. + did: String, + /// The resolved account handle (for logging). + handle: String, +} + +/// Authenticate via `com.atproto.server.createSession`. +/// +/// Reads `BLUESKY_HANDLE` + `BLUESKY_APP_PASSWORD` from the environment; +/// these are required and never sourced from `govbot.yml`. +fn create_session(service: &str) -> Result { + let handle = require_env("BLUESKY_HANDLE")?; + let password = require_env("BLUESKY_APP_PASSWORD")?; + + let url = format!( + "{}/xrpc/com.atproto.server.createSession", + service.trim_end_matches('/') + ); + // `http_status_as_error(false)` keeps a non-2xx response an `Ok` so we can + // read its body for an actionable error; only transport errors are `Err`. + let response = ureq::post(&url) + .config() + .http_status_as_error(false) + .build() + .header("Content-Type", "application/json") + .send_json(json!({ "identifier": handle, "password": password })) + .context("createSession request failed")?; + + let status = response.status(); + let mut resp_body = response.into_body(); + if !status.is_success() { + let detail = resp_body + .read_to_string() + .unwrap_or_else(|_| "".to_string()); + anyhow::bail!( + "createSession returned HTTP {} — check BLUESKY_HANDLE / \ + BLUESKY_APP_PASSWORD (use an app password, not the main \ + password). Response: {}", + status.as_u16(), + detail + ); + } + let body: Value = resp_body + .read_json() + .context("Failed to parse createSession response")?; + + let access_jwt = body + .get("accessJwt") + .and_then(|v| v.as_str()) + .context("createSession response missing accessJwt")? + .to_string(); + let did = body + .get("did") + .and_then(|v| v.as_str()) + .context("createSession response missing did")? + .to_string(); + let handle = body + .get("handle") + .and_then(|v| v.as_str()) + .unwrap_or(&handle) + .to_string(); + + Ok(Session { + access_jwt, + did, + handle, + }) +} + +/// Post one `app.bsky.feed.post` record via `com.atproto.repo.createRecord`. +/// Returns the AT URI of the created record. +fn create_post(service: &str, session: &Session, text: &str) -> Result { + let url = format!( + "{}/xrpc/com.atproto.repo.createRecord", + service.trim_end_matches('/') + ); + // RFC-3339 UTC timestamp, as the AT Protocol expects for `createdAt`. + let created_at = chrono::Utc::now() + .format("%Y-%m-%dT%H:%M:%S%.3fZ") + .to_string(); + + let response = ureq::post(&url) + .config() + .http_status_as_error(false) + .build() + .header("Authorization", &format!("Bearer {}", session.access_jwt)) + .header("Content-Type", "application/json") + .send_json(json!({ + "repo": session.did, + "collection": "app.bsky.feed.post", + "record": { + "$type": "app.bsky.feed.post", + "text": text, + "createdAt": created_at, + } + })) + .context("createRecord request failed")?; + + let status = response.status(); + let mut resp_body = response.into_body(); + if !status.is_success() { + let detail = resp_body + .read_to_string() + .unwrap_or_else(|_| "".to_string()); + anyhow::bail!("createRecord returned HTTP {}: {}", status.as_u16(), detail); + } + let body: Value = resp_body + .read_json() + .context("Failed to parse createRecord response")?; + + Ok(body + .get("uri") + .and_then(|v| v.as_str()) + .unwrap_or("") + .to_string()) +} + +/// True when both required Bluesky credential env vars are set and non-empty. +/// Used to decide whether the publisher should skip-with-WARN (missing creds) +/// or attempt the live authentication flow. +fn creds_present() -> bool { + env_nonempty("BLUESKY_HANDLE") && env_nonempty("BLUESKY_APP_PASSWORD") +} + +/// True when `key` is set to a non-empty (and non-whitespace) value. +fn env_nonempty(key: &str) -> bool { + std::env::var(key) + .ok() + .map(|v| !v.trim().is_empty()) + .unwrap_or(false) +} + +/// Read a required environment variable, with an actionable error message. +fn require_env(key: &str) -> Result { + std::env::var(key) + .ok() + .filter(|v| !v.trim().is_empty()) + .with_context(|| { + format!( + "the `bluesky` publisher needs the {key} environment variable. \ + Set BLUESKY_HANDLE and BLUESKY_APP_PASSWORD (an app password \ + from Bluesky Settings → App Passwords). Never put credentials \ + in govbot.yml." + ) + }) +} + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + + #[test] + fn truncate_respects_limit() { + let long = "word ".repeat(100); + let out = truncate_post(&long); + assert!(out.chars().count() <= POST_TEXT_LIMIT); + assert!(out.ends_with('…')); + } + + #[test] + fn truncate_leaves_short_text_alone() { + assert_eq!(truncate_post(" hello "), "hello"); + } + + #[test] + fn threshold_selects_on_calibrated_score() { + let entry = json!({ + "tags": { "clean_energy": { "final_score": 0.8 } } + }); + assert!(record_clears_threshold(&entry, &[], 0.6)); + assert!(record_clears_threshold( + &entry, + &["clean_energy".to_string()], + 0.6 + )); + assert!(!record_clears_threshold(&entry, &[], 0.9)); + assert!(!record_clears_threshold( + &entry, + &["fossil_fuels".to_string()], + 0.6 + )); + } + + #[test] + fn threshold_rejects_untagged() { + assert!(!record_clears_threshold(&json!({}), &[], 0.0)); + assert!(!record_clears_threshold(&json!({ "tags": {} }), &[], 0.0)); + } + + /// When BLUESKY_HANDLE / BLUESKY_APP_PASSWORD are absent, `creds_present` + /// reports `false` — the signal that lets `run_bluesky` skip with a WARN + /// instead of failing the whole pipeline. With both set non-empty, + /// `true`. + /// + /// This test mutates process env; `cargo test` runs threads in parallel by + /// default, so it locks a mutex around the env touch. + #[test] + fn creds_present_reflects_env() { + // Serialise env mutation across the env-touching tests so parallel + // test threads can't see each other's writes mid-check. + use std::sync::Mutex; + static ENV_LOCK: Mutex<()> = Mutex::new(()); + let _g = ENV_LOCK.lock().unwrap(); + + // Snapshot original values to restore at the end. + let prev_h = std::env::var("BLUESKY_HANDLE").ok(); + let prev_p = std::env::var("BLUESKY_APP_PASSWORD").ok(); + + std::env::remove_var("BLUESKY_HANDLE"); + std::env::remove_var("BLUESKY_APP_PASSWORD"); + assert!(!creds_present()); + + std::env::set_var("BLUESKY_HANDLE", "x.bsky.social"); + assert!(!creds_present()); // password still missing + + std::env::set_var("BLUESKY_APP_PASSWORD", "abcd-efgh-ijkl-mnop"); + assert!(creds_present()); + + std::env::set_var("BLUESKY_HANDLE", " "); // whitespace-only + assert!(!creds_present()); + + // Restore. + match prev_h { + Some(v) => std::env::set_var("BLUESKY_HANDLE", v), + None => std::env::remove_var("BLUESKY_HANDLE"), + } + match prev_p { + Some(v) => std::env::set_var("BLUESKY_APP_PASSWORD", v), + None => std::env::remove_var("BLUESKY_APP_PASSWORD"), + } + } + + #[test] + fn render_substitutes_template_placeholders() { + let entry = json!({ + "id": "wy-legislation/.../HB0001", + "bill": { "title": "Renewable energy storage act", "identifier": "HB 1" }, + "tags": { "clean_energy": { "final_score": 0.92 } } + }); + let post = render_post( + &entry, + Some("{title} [{identifier}] {tags} {score}"), + None, + None, + ); + assert!(post.text.contains("Renewable energy storage act")); + assert!(post.text.contains("[HB 1]")); + assert!(post.text.contains("clean_energy")); + assert!(post.text.contains("0.92")); + } + + /// `{link}` renders the publisher's `base_url` joined to the bill's + /// source-log path — same shape as the rss/html publishers. Before the + /// fix, bluesky passed `None` and `{link}` rendered empty. + #[test] + fn render_link_uses_publisher_base_url() { + let entry = json!({ + "id": "wy-legislation/.../HB0001", + "bill": { "title": "Wind energy permitting act", "identifier": "HB 1" }, + "sources": { "bill": "wy-legislation/.../HB0001/metadata.json" }, + "tags": { "clean_energy": { "final_score": 0.91 } } + }); + let post = render_post( + &entry, + Some("{title} {link}"), + Some("https://example.org/climate-tracker"), + None, // no companion html publisher + ); + assert!( + post.text.contains( + "https://example.org/climate-tracker/wy-legislation/.../HB0001/metadata.json" + ), + "expected base_url to be prepended to source path; got: {}", + post.text + ); + } + + /// Without a configured `base_url`, `{link}` falls back to the bill's + /// `bill.sources[0].url` (when present) — preserves the historical + /// shape and gives manifest authors a sensible default before they pick + /// a base_url. + #[test] + fn render_link_falls_back_to_bill_source_url() { + let entry = json!({ + "id": "wy-legislation/.../HB0001", + "bill": { + "title": "Solar tax-credit act", + "identifier": "HB 1", + "sources": [{ "url": "https://wyoleg.gov/2025/Bills/HB0001" }] + }, + "tags": { "clean_energy": { "final_score": 0.9 } } + }); + let post = render_post(&entry, Some("{title} -> {link}"), None, None); + assert!( + post.text.contains("https://wyoleg.gov/2025/Bills/HB0001"), + "expected bill.sources[0].url to render as {{link}}; got: {}", + post.text + ); + } + + /// Bug 7 regression: when the manifest has a companion `html` publisher, + /// `{link}` resolves to that publisher's landing-page URL — not to the + /// raw `metadata.json` path under bluesky's own `base_url`. + /// + /// Before this fix, with bluesky `base_url: + /// https://example.org/climate-tracker` set, a userland dry-run rendered: + /// https://example.org/climate-tracker/wy-legislation/.../HB9999/metadata.json + /// which is a JSON file, not a human page. + #[test] + fn render_link_prefers_html_publisher_landing_page() { + let entry = json!({ + "id": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB9999", + "bill": { "title": "Clean energy tax credit", "identifier": "HB9999" }, + "sources": { + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB9999/metadata.json" + }, + "tags": { "clean_energy": { "final_score": 0.91 } } + }); + let post = render_post( + &entry, + Some("{title} -> {link}"), + Some("https://example.org/climate-tracker"), // bluesky's own base_url + Some("https://example.org/climate-tracker"), // companion html publisher's base_url + ); + // Must NOT route activists at the raw JSON path. + assert!( + !post.text.contains("metadata.json"), + "expected {{link}} to skip the metadata.json path when a companion html publisher exists; got: {}", + post.text + ); + // Must land at the html publisher's URL — the human-readable index. + assert!( + post.text.contains("https://example.org/climate-tracker"), + "expected {{link}} to resolve to the html publisher's landing-page URL; got: {}", + post.text + ); + } + + // ------------------------------------------------------------ + // Ledger-path regression tests (Bug: ledger in `.govbot/`) + // ------------------------------------------------------------ + + use crate::config::{Publisher, PublisherKind}; + use tempfile::tempdir; + + /// Build a minimal bluesky `Publisher` with `ledger = None` so the + /// default-path resolution is exercised. + fn bluesky_publisher_default() -> Publisher { + Publisher { + kind: PublisherKind::Bluesky, + select: None, + base_url: None, + output_dir: None, + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + } + } + + fn job_for_publisher<'a>( + name: &'a str, + publisher: &'a Publisher, + project_dir: PathBuf, + ) -> PublishJob<'a> { + PublishJob { + name, + publisher, + entries: vec![], + output_dir_override: None, + output_file_override: None, + project_dir, + dry_run: false, + html_entry_url: None, + } + } + + /// The default ledger path lands under `state/`, NOT `.govbot/`. + /// `.govbot/` is the tool's regenerable cache (node_modules/-style); + /// the ledger is user-meaningful state — deleting `.govbot/` to + /// reset the cache must not destroy post history. + #[test] + fn default_ledger_path_lives_under_state_not_govbot_cache() { + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + let resolved = resolve_ledger_path(&job); + assert_eq!( + resolved, + dir.path().join("state").join("bluesky-bluesky.ledger"), + "default ledger must be /state/bluesky-.ledger, not under .govbot/" + ); + // Cross-check: it must NOT be under the cache dir. + assert!( + !resolved.starts_with(dir.path().join(".govbot")), + "default ledger must never resolve under .govbot/ (the cache); got: {}", + resolved.display() + ); + } + + /// An explicit `ledger:` field in `govbot.yml` is honoured verbatim + /// (relative to the project dir) — including absolute paths — so a + /// user who deliberately wants a specific location can pin it. + #[test] + fn explicit_ledger_field_overrides_default() { + let dir = tempdir().unwrap(); + let mut p = bluesky_publisher_default(); + p.ledger = Some("custom/posted.ledger".to_string()); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + assert_eq!( + resolve_ledger_path(&job), + dir.path().join("custom/posted.ledger") + ); + + // Absolute paths pass through untouched. + let abs = dir.path().join("abs.ledger"); + p.ledger = Some(abs.to_string_lossy().to_string()); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + assert_eq!(resolve_ledger_path(&job), abs); + } + + /// Backward-compat: an existing pre-fix ledger at the legacy + /// `.govbot/bluesky-.ledger` path is read so upgrading users + /// don't lose their post history. `read_ledger` is the unit-level + /// surface; `run_bluesky` unions the two on read. + #[test] + fn legacy_govbot_ledger_is_readable_as_fallback() { + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + + // Seed only the legacy path; new path stays absent. + let legacy = legacy_ledger_path(&job); + std::fs::create_dir_all(legacy.parent().unwrap()).unwrap(); + std::fs::write(&legacy, "wy-legislation/.../HB9999\n").unwrap(); + + // The new path resolves under state/ and has no file yet — the + // primary read is empty, the legacy read carries the history. + let new_path = resolve_ledger_path(&job); + assert!(!new_path.exists()); + assert!(read_ledger(&new_path).unwrap().is_empty()); + + let legacy_seen = read_ledger(&legacy).unwrap(); + assert!( + legacy_seen.contains("wy-legislation/.../HB9999"), + "legacy ledger must be readable so upgrades preserve post history" + ); + } + + /// Writes always land at the *new* path even when a legacy ledger + /// exists — so the legacy file becomes harmless after one full + /// re-run and the user (or a future `govbot migrate`) can delete it. + #[test] + fn appends_land_at_new_path_not_legacy() { + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + + // Pre-populate the legacy ledger to simulate an upgrading project. + let legacy = legacy_ledger_path(&job); + std::fs::create_dir_all(legacy.parent().unwrap()).unwrap(); + std::fs::write(&legacy, "old-id\n").unwrap(); + let legacy_before = std::fs::read_to_string(&legacy).unwrap(); + + // Append via the resolved (new) path — the production code path. + let new_path = resolve_ledger_path(&job); + append_ledger(&new_path, "new-id").unwrap(); + + // New path now holds the new id. + let new_contents = std::fs::read_to_string(&new_path).unwrap(); + assert!(new_contents.contains("new-id")); + // Legacy is untouched — we never write there. + let legacy_after = std::fs::read_to_string(&legacy).unwrap(); + assert_eq!( + legacy_before, legacy_after, + "writes must never land at the legacy .govbot/ ledger path" + ); + // The new path is under state/, not .govbot/. + assert!(new_path.starts_with(dir.path().join("state"))); + } + + // ------------------------------------------------------------ + // Per-bill dedup regression tests (Bug: posting once per action log) + // ------------------------------------------------------------ + + /// A synthetic log entry for a single bill — the shape `govbot source + /// --join bill,tags` emits. `score` is the calibrated `final_score` + /// for the `clean_energy` tag (the test default). + fn log_entry( + dataset: &str, + session: &str, + bill_id: &str, + log_filename: &str, + score: f64, + ) -> Value { + let log_path = format!( + "{}/country:us/state:xx/sessions/{}/logs/{}", + dataset, session, log_filename + ); + json!({ + "id": bill_id, + "bill": { "title": format!("Bill {}", bill_id), "identifier": bill_id }, + "log": { "bill_id": bill_id }, + "sources": { "log": log_path }, + "tags": { "clean_energy": { "final_score": score } } + }) + } + + /// Six action-log entries for the same NV AB1 bill (the audit's worst + /// case — 6 of 96 posts were the same bill) must collapse to ONE + /// rendered post, not six. + #[test] + fn bluesky_publisher_emits_one_post_per_bill_even_with_multiple_action_logs() { + let entries: Vec = (1..=6) + .map(|i| { + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + 0.92, + ) + }) + .collect(); + + let reps = pick_per_bill_representatives(&entries, &[], 0.5); + assert_eq!( + reps.len(), + 1, + "6 action logs for the same bill must collapse to 1 representative; got {}", + reps.len() + ); + // The representative's bill_guid is the canonical bill path — + // independent of which log won, all six share it. + assert_eq!( + crate::rss::bill_guid(reps[0]), + "nv-legislation/country:us/state:xx/sessions/2025Special36/bills/AB1", + "the representative must carry the bill-level guid, not a log-level one" + ); + } + + /// When multiple logs for the same bill score above the threshold, the + /// **highest-scoring** log becomes the representative — not the first + /// or newest. The post's text comes from that representative. + #[test] + fn bluesky_publisher_picks_the_highest_scoring_log_when_multiple_score() { + let entries = vec![ + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.weak.json", + 0.55, + ), + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.strong.json", + 0.95, // highest + ), + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251113T080000Z.mid.json", + 0.70, + ), + ]; + + let reps = pick_per_bill_representatives(&entries, &[], 0.5); + assert_eq!(reps.len(), 1, "must collapse to 1 representative"); + // The picked rep must be the 0.95-scoring log (the "strong" one), + // which the test labels into the log filename so we can read it + // straight off `sources.log`. + let log_path = reps[0] + .get("sources") + .and_then(|s| s.get("log")) + .and_then(|v| v.as_str()) + .unwrap_or(""); + assert!( + log_path.contains("strong"), + "expected the highest-scoring log to be the representative; got {}", + log_path + ); + } + + /// A bill posted once writes a **bill-level** GUID to the ledger. + /// When the next run discovers a *new* action log for the same bill, + /// the bill must NOT re-post — the ledger key is the bill, not the + /// log, so future logs are deduplicated to the same key and the + /// publisher recognises the bill as already-posted. + #[test] + fn bluesky_ledger_uses_bill_level_guid_to_prevent_repost_when_new_logs_appear() { + // Round 1: post the bill once via its first action log. + let bill_path = + "nv-legislation/country:us/state:xx/sessions/2025Special36/bills/AB1".to_string(); + + let round1 = vec![log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.first.json", + 0.92, + )]; + let reps1 = pick_per_bill_representatives(&round1, &[], 0.5); + assert_eq!(reps1.len(), 1); + let post1 = render_post(reps1[0], None, None, None); + assert_eq!( + post1.id, bill_path, + "ledger id must be the bill-level guid (no /logs/...) — got {}", + post1.id + ); + + // Simulate writing the ledger. + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + let ledger = resolve_ledger_path(&job); + append_ledger(&ledger, &post1.id).unwrap(); + + // Round 2: a **new** action log for the same bill arrives. + let round2 = vec![ + // The old log is still in the stream (the source walks every + // log file on disk every run). + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.first.json", + 0.92, + ), + // Plus a freshly-arrived second log. + log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.second.json", + 0.93, + ), + ]; + let reps2 = pick_per_bill_representatives(&round2, &[], 0.5); + let post2 = render_post(reps2[0], None, None, None); + assert_eq!( + post2.id, bill_path, + "representative's ledger id must still be the bill-level guid" + ); + + // The ledger already contains this bill — `run_bluesky` would + // filter it out as already-posted. Confirm at the unit-level: + let already: HashSet = read_ledger(&ledger) + .unwrap() + .into_iter() + .map(|s| ledger_id_to_bill_key(&s)) + .collect(); + assert!( + already.contains(&post2.id), + "ledger should recognise the bill as already-posted; ledger={:?}, post.id={}", + already, + post2.id + ); + + // And confirm we wouldn't append a duplicate. + let before = std::fs::read_to_string(&ledger).unwrap(); + let lines_before = before.lines().count(); + assert_eq!( + lines_before, 1, + "ledger should hold exactly one entry for the bill" + ); + } + + /// A pre-fix ledger holding **per-log** GUIDs is still read on + /// upgrade — the publisher doesn't crash, and per-bill-log-layout + /// entries cleanly suppress re-posts. The session-level-log-layout + /// case incurs the documented one-time re-post (see + /// `ledger_id_to_bill_key`). + #[test] + fn bluesky_ledger_respects_legacy_per_log_guids() { + // Per-bill-log layout: legacy GUID already carries `/bills/` + // before the `/logs/` segment, so stripping `/logs/...` yields + // the new bill-level key directly. The bill is recognised as + // already-posted and re-posts are suppressed. + let legacy_per_bill_log = + "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250101T000000Z.passage.json"; + let expected_bill_key = "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001"; + assert_eq!( + ledger_id_to_bill_key(legacy_per_bill_log), + expected_bill_key, + "per-bill-log legacy entries strip to the bill-level guid cleanly" + ); + + // Session-level-log layout (OCD-files common case): legacy GUID + // ends in `/logs/`; stripping yields the session + // prefix, which does NOT match the new bill-level key. We + // document the resulting behavior — the bill re-posts once, + // then the new bill-level GUID lands in the ledger and never + // re-posts again. + let legacy_session_log = + "nv-legislation/country:us/state:nv/sessions/2025Special36/logs/20251111T080000Z.first.json"; + assert_eq!( + ledger_id_to_bill_key(legacy_session_log), + "nv-legislation/country:us/state:nv/sessions/2025Special36", + "session-level legacy entries strip to the session prefix — the bill \ + segment isn't in the legacy path so it can't be recovered. \ + Bills under this layout re-post once on the first post-upgrade run." + ); + + // End-to-end: seed a legacy ledger with the per-bill-log entry, + // confirm the publisher reads it and recognises the bill as + // already-posted (matching the new bill-level GUID a post would + // write). + let dir = tempdir().unwrap(); + let p = bluesky_publisher_default(); + let job = job_for_publisher("bluesky", &p, dir.path().to_path_buf()); + + let legacy = legacy_ledger_path(&job); + std::fs::create_dir_all(legacy.parent().unwrap()).unwrap(); + std::fs::write(&legacy, format!("{}\n", legacy_per_bill_log)).unwrap(); + + // Build a new HB0001 entry whose `sources.log` happens to be the + // per-bill-log path (matches what a post would render). + let entry = json!({ + "id": "HB0001", + "bill": { "title": "WY HB0001", "identifier": "HB0001" }, + "log": { "bill_id": "HB0001" }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250102T000000Z.next.json" + }, + "tags": { "clean_energy": { "final_score": 0.92 } } + }); + let post = render_post(&entry, None, None, None); + assert_eq!( + post.id, expected_bill_key, + "new post's id is the bill-level guid" + ); + + // The legacy ledger entry collapses to the same bill-level key, + // so the publisher's already-posted set contains the bill. + let already: HashSet = read_ledger(&legacy) + .unwrap() + .into_iter() + .map(|s| ledger_id_to_bill_key(&s)) + .collect(); + assert!( + already.contains(&post.id), + "the legacy per-bill-log GUID must collapse to the same bill-level \ + key the new post writes; ledger={:?}, post.id={}", + already, + post.id + ); + } + + /// `bill_guid` is the canonical bill key; it strips `/logs/...` from + /// `sources.log` and appends `/bills/` (the OCD-files common + /// case). Sanity-check the shape and the dedup the publisher relies on. + #[test] + fn bill_guid_collapses_session_level_logs_to_one_bill_key() { + let a = log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.a.json", + 0.9, + ); + let b = log_entry( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.b.json", + 0.9, + ); + let c = log_entry( + "nv-legislation", + "2025Special36", + "AB2", // different bill + "20251111T080000Z.c.json", + 0.9, + ); + assert_eq!(crate::rss::bill_guid(&a), crate::rss::bill_guid(&b)); + assert_ne!(crate::rss::bill_guid(&a), crate::rss::bill_guid(&c)); + } +} diff --git a/actions/govbot/src/cache.rs b/actions/govbot/src/cache.rs new file mode 100644 index 00000000..388822f7 --- /dev/null +++ b/actions/govbot/src/cache.rs @@ -0,0 +1,162 @@ +//! The shared, content-addressed dataset cache at `~/.govbot/cache/`. +//! +//! ## The problem this solves +//! +//! Before this, every govbot project cloned every dataset into its own +//! `/.govbot/repos/`. Ten climate projects on one laptop meant ten +//! clones of `wy-legislation`. The cache makes a dataset **cloned once per +//! machine**: the heavy git repo lives in `~/.govbot/cache/`, and a project's +//! `.govbot/repos/` is a lightweight reference into it. +//! +//! ## Layout +//! +//! ```text +//! ~/.govbot/ +//! cache/ +//! / a bare-ish working clone of one dataset@channel +//! registry.json most-recently fetched registry (see registry.rs) +//! ``` +//! +//! The cache **key** is content-addressed over the dataset's *identity* — its +//! git URL plus channel — as a short SHA-256 hex digest, prefixed with the +//! dataset's short name for human readability: +//! +//! ```text +//! wy-legislation-3f9a1c20e5b4 +//! us-counties__cook-7a2b... (a '/' in a namespace becomes '__') +//! ``` +//! +//! Keying on URL+channel (not on a resolved SHA) keeps the clone path stable +//! across `pull`s: the same dataset always maps to the same cache directory, +//! `git pull` updates it in place, and `govbot.lock` records the exact SHA. +//! A *second* `pull` in any project finds the cache populated and only fetches +//! deltas — no re-clone. +//! +//! ## How a project references the cache +//! +//! A project's `.govbot/repos/` is a symlink to the cache entry +//! (a plain directory copy is the fallback where symlinks are unavailable). +//! Downstream code (`source`, `load`) walks `.govbot/repos/` exactly as +//! before — it does not need to know the cache exists. + +use crate::error::{Error, Result}; +use sha2::{Digest, Sha256}; +use std::path::PathBuf; + +/// The govbot home directory: `~/.govbot`. Honors `GOVBOT_HOME` for tests. +pub fn govbot_home() -> Option { + if let Some(explicit) = std::env::var_os("GOVBOT_HOME") { + let p = PathBuf::from(explicit); + if !p.as_os_str().is_empty() { + return Some(p); + } + } + std::env::var_os("HOME") + .or_else(|| std::env::var_os("USERPROFILE")) + .map(PathBuf::from) + .filter(|p| !p.as_os_str().is_empty()) + .map(|h| h.join(".govbot")) +} + +/// The shared content-addressed cache directory: `~/.govbot/cache`. +pub fn cache_dir() -> Result { + let home = govbot_home() + .ok_or_else(|| Error::Config("Could not determine home directory for cache".into()))?; + Ok(home.join("cache")) +} + +/// Compute the content-addressed cache key for a dataset's identity. +/// +/// The key is `-` where the digest is the first 12 hex +/// chars of `sha256(git_url + "@" + channel)`. A `/` in the short name (it +/// should not contain one, but be defensive) becomes `__`. +pub fn cache_key(short_name: &str, git_url: &str, channel: Option<&str>) -> String { + let mut hasher = Sha256::new(); + hasher.update(git_url.as_bytes()); + hasher.update(b"@"); + hasher.update(channel.unwrap_or("").as_bytes()); + let digest = hasher.finalize(); + let hex: String = digest + .iter() + .take(6) + .map(|b| format!("{:02x}", b)) + .collect(); + let safe_name = short_name.replace('/', "__"); + format!("{}-{}", safe_name, hex) +} + +/// The absolute path of a dataset's entry in the shared cache. +pub fn cache_path(short_name: &str, git_url: &str, channel: Option<&str>) -> Result { + Ok(cache_dir()?.join(cache_key(short_name, git_url, channel))) +} + +/// Link a project's `repos/` directory to a populated cache entry. +/// +/// Prefers a symlink (cheap, shared); falls back to recording the cache path +/// when symlinks are unavailable. Idempotent — an existing correct link is a +/// no-op; a stale link is replaced. +pub fn link_into_project( + cache_entry: &std::path::Path, + project_repo: &std::path::Path, +) -> Result<()> { + if let Some(parent) = project_repo.parent() { + std::fs::create_dir_all(parent)?; + } + + // If the project repo path is already a symlink to the right place, done. + if let Ok(existing) = std::fs::read_link(project_repo) { + if existing == cache_entry { + return Ok(()); + } + // Stale symlink — remove it. + let _ = std::fs::remove_file(project_repo); + } else if project_repo.exists() { + // A real directory is sitting where the link should be (a pre-cache + // clone). Remove it so the cache becomes the single source of truth. + let _ = std::fs::remove_dir_all(project_repo); + } + + #[cfg(unix)] + { + std::os::unix::fs::symlink(cache_entry, project_repo).map_err(|e| { + Error::Config(format!( + "Failed to link cache entry {} into project: {}", + cache_entry.display(), + e + )) + })?; + Ok(()) + } + + #[cfg(not(unix))] + { + std::os::windows::fs::symlink_dir(cache_entry, project_repo).map_err(|e| { + Error::Config(format!( + "Failed to link cache entry {} into project: {}", + cache_entry.display(), + e + )) + })?; + Ok(()) + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn cache_key_is_stable_and_named() { + let k1 = cache_key("wy-legislation", "https://example.com/wy.git", None); + let k2 = cache_key("wy-legislation", "https://example.com/wy.git", None); + assert_eq!(k1, k2, "cache key must be deterministic"); + assert!(k1.starts_with("wy-legislation-")); + } + + #[test] + fn cache_key_differs_by_url_and_channel() { + let base = cache_key("wy", "https://a/wy.git", None); + assert_ne!(base, cache_key("wy", "https://b/wy.git", None)); + assert_ne!(base, cache_key("wy", "https://a/wy.git", Some("nightly"))); + } +} diff --git a/actions/govbot/src/config.rs b/actions/govbot/src/config.rs index dfbf0d4e..6bf1e808 100644 --- a/actions/govbot/src/config.rs +++ b/actions/govbot/src/config.rs @@ -1,5 +1,188 @@ use crate::error::{Error, Result}; -use std::path::PathBuf; +use serde::Deserialize; +use std::collections::BTreeMap; +use std::path::{Path, PathBuf}; + +// ============================================================ +// govbot.yml — the project manifest (datasets / transforms / +// publish / pipelines). This is the typed view of the schema in +// `schemas/govbot.schema.json`. It is the layer-2 contract config +// and is distinct from the pipeline-processor `Config` below +// (whose `repos` is CLI-arg state, not manifest state). +// ============================================================ + +/// A `govbot.yml` manifest. `additionalProperties: false` in the schema — +/// an unknown top-level key (notably the retired `tags:`) fails to parse. +#[derive(Debug, Clone, Deserialize)] +#[serde(deny_unknown_fields)] +pub struct Manifest { + /// Optional `$schema` reference for editor autocomplete; ignored at runtime. + #[serde(default, rename = "$schema")] + pub schema: Option, + + /// Government-data sources the project pulls and processes. + pub datasets: Vec, + + /// Named external-process transforms, keyed by name. + #[serde(default)] + pub transforms: BTreeMap, + + /// Named publishers, keyed by name. + #[serde(default)] + pub publish: BTreeMap, + + /// Named `govbot run` targets — ordered lists of transform/publisher names. + #[serde(default)] + pub pipelines: BTreeMap>, +} + +/// A single external-process transform stage. +#[derive(Debug, Clone, Deserialize)] +pub struct Transform { + /// The external process to run. Either a shell-style string or an argv array. + pub command: Command_, + + /// The stream record kind this transform consumes (e.g. `docs`). + pub reads: String, + + /// The stream record kind this transform produces (e.g. `classification`). + pub writes: String, + + /// For a classify-style transform: the path to the fastclass classifier + /// bundle directory. govbot passes this path through unchanged. + #[serde(default)] + pub classifier: Option, +} + +/// A transform `command`: either a single shell-style string or an argv array. +#[derive(Debug, Clone, Deserialize)] +#[serde(untagged)] +pub enum Command_ { + /// A single string, split on whitespace into argv. + Shell(String), + /// An explicit argv array (first element is the executable). + Argv(Vec), +} + +impl Command_ { + /// Resolve to an argv vector. A `Shell` string is whitespace-split. + pub fn argv(&self) -> Vec { + match self { + Command_::Shell(s) => s.split_whitespace().map(|s| s.to_string()).collect(), + Command_::Argv(v) => v.clone(), + } + } +} + +/// The publisher kind. Mirrors `govbot.schema.json`'s `publisher.type` enum. +#[derive(Debug, Clone, Copy, PartialEq, Eq, Deserialize)] +#[serde(rename_all = "lowercase")] +pub enum PublisherKind { + Rss, + Html, + Json, + Duckdb, + /// Bluesky publisher — the extension point for Wave 3 (not yet implemented). + Bluesky, +} + +/// A single publisher stage. Required fields depend on `type`. +#[derive(Debug, Clone, Deserialize)] +pub struct Publisher { + /// The publisher kind (`rss` / `html` / `json` / `duckdb` / `bluesky`). + #[serde(rename = "type")] + pub kind: PublisherKind, + + /// Tag names to include. Only records carrying one of these tags are + /// published; if omitted, all tagged records are published. + #[serde(default)] + pub select: Option>, + + /// Base URL for generated links (required for `rss`/`html`). + #[serde(default)] + pub base_url: Option, + + /// Directory the publisher writes artifacts into (used by rss/html/json). + #[serde(default)] + pub output_dir: Option, + + /// Output filename for the primary artifact. + #[serde(default)] + pub output_file: Option, + + /// Custom feed/index title. + #[serde(default)] + pub title: Option, + + /// Custom feed/index description. + #[serde(default)] + pub description: Option, + + /// Maximum number of entries. The string `"none"` means no limit. + #[serde(default)] + pub limit: Option, + + // ---- bluesky-publisher fields ---------------------------------------- + // These configure the `bluesky` publisher only; other kinds ignore them. + // Credentials are NOT here — they are read from the environment + // (`BLUESKY_HANDLE` / `BLUESKY_APP_PASSWORD` / `BLUESKY_SERVICE`). + /// Minimum calibrated `final_score` a matched tag must reach for a record + /// to be posted. `final_score` is the contractually calibrated probability + /// from the fastclass result (STREAM_PROTOCOL §5). + #[serde(default)] + pub min_score: Option, + + /// Path to the append-only posted-state ledger that makes the publisher + /// idempotent — re-runs never double-post. Relative to the project + /// directory; defaults to `state/bluesky-.ledger` (peer to + /// `tags/` and `dist/`; NOT under `.govbot/`, which is the tool's + /// regenerable cache). On upgrade, a legacy + /// `.govbot/bluesky-.ledger` is read as a fallback so post + /// history survives; writes always land at the new path. + #[serde(default)] + pub ledger: Option, + + /// Post-text template. `{placeholders}` are substituted per record: + /// `{title}`, `{tags}`, `{link}`, `{identifier}`, `{session}`, `{score}`. + /// If omitted, a sensible default template is used. + #[serde(default)] + pub post_template: Option, +} + +impl Publisher { + /// Resolve the calibrated-score threshold for the `bluesky` publisher. + /// Falls back to a conservative default so a misconfigured manifest does + /// not flood a feed with low-confidence matches. + pub fn resolved_min_score(&self) -> f64 { + self.min_score.unwrap_or(0.6) + } + + /// Resolve `limit` to an `Option`: `None` means unlimited, the + /// string `"none"` also means unlimited, an integer is the cap. + pub fn resolved_limit(&self, default: Option) -> Option { + match &self.limit { + None => default, + Some(serde_yaml::Value::String(s)) if s.eq_ignore_ascii_case("none") => None, + Some(serde_yaml::Value::String(s)) => s.parse().ok().or(default), + Some(serde_yaml::Value::Number(n)) => n.as_u64().map(|n| n as usize).or(default), + Some(_) => default, + } + } +} + +impl Manifest { + /// Load and parse a `govbot.yml` manifest. A manifest carrying the retired + /// `tags:` block (or any other unknown key) fails here via + /// `deny_unknown_fields`. + pub fn load(path: &Path) -> anyhow::Result { + use anyhow::Context; + let contents = std::fs::read_to_string(path) + .with_context(|| format!("Failed to read manifest: {}", path.display()))?; + let manifest: Manifest = serde_yaml::from_str(&contents) + .with_context(|| format!("Failed to parse govbot.yml manifest: {}", path.display()))?; + Ok(manifest) + } +} /// Sort order for log entries #[derive(Debug, Clone, Copy, PartialEq, Eq)] @@ -12,7 +195,7 @@ impl From<&str> for SortOrder { fn from(s: &str) -> Self { match s.to_uppercase().as_str() { "ASC" => SortOrder::Ascending, - "DESC" | _ => SortOrder::Descending, + _ => SortOrder::Descending, } } } diff --git a/actions/govbot/src/embeddings.rs b/actions/govbot/src/embeddings.rs deleted file mode 100644 index feb68ab2..00000000 --- a/actions/govbot/src/embeddings.rs +++ /dev/null @@ -1,496 +0,0 @@ -use ort::inputs; -use ort::session::Session; -use ort::value::Value; -use std::collections::HashMap; -use std::path::Path; -use tokenizers::Tokenizer; - -use ndarray::Array1; -use regex::Regex; -use serde::{Deserialize, Serialize}; -use sha2::{Digest, Sha256}; - -use crate::selectors::ocd_files_select_default; - -/// Breakdown of scoring components for a tag match -#[derive(Debug, Clone, Serialize, Deserialize)] -pub struct ScoreBreakdown { - pub final_score: f64, - pub base_embedding: Option, - pub example_similarity: Option, - /// Keywords from include_keywords that matched in the text - #[serde(default)] - pub keyword_match: Vec, - pub negative_penalty: f64, -} - -/// Tag file structure with metadata, text cache, and bill results -#[derive(Debug, Clone, Serialize, Deserialize)] -pub struct TagFile { - pub metadata: TagFileMetadata, - pub tag_config: TagDefinition, - #[serde(default)] - pub text_cache: HashMap, - pub bills: HashMap, -} - -/// Metadata about the tag file -#[derive(Debug, Clone, Serialize, Deserialize)] -pub struct TagFileMetadata { - pub last_run: String, - pub model: String, - pub tag_config_hash: String, -} - -/// Result for a single bill -#[derive(Debug, Clone, Serialize, Deserialize)] -pub struct BillTagResult { - pub text_hash: String, - pub score: ScoreBreakdown, -} - -/// Hash text for deduplication -pub fn hash_text(text: &str) -> String { - let mut hasher = Sha256::new(); - hasher.update(text.as_bytes()); - format!("{:x}", hasher.finalize()) -} - -/// Tag definition provided by the creator -#[derive(Debug, Deserialize, Serialize, Clone)] -pub struct TagDefinition { - pub name: String, - #[serde(default)] - pub description: String, - #[serde(default)] - pub examples: Vec, - #[serde(default)] - pub include_keywords: Vec, - #[serde(default)] - pub exclude_keywords: Vec, - #[serde(default)] - pub negative_examples: Vec, - /// Minimum similarity score (0.0 - 1.0). Default to 0.5 if not provided. - #[serde(default = "default_threshold")] - pub threshold: f32, -} - -fn default_threshold() -> f32 { - 0.5 -} - -#[derive(Debug, Deserialize)] -pub struct RawTag { - #[serde(default)] - pub description: String, - #[serde(default)] - pub examples: Vec, - #[serde(default)] - pub include_keywords: Vec, - #[serde(default)] - pub exclude_keywords: Vec, - #[serde(default)] - pub negative_examples: Vec, - #[serde(default = "default_threshold")] - pub threshold: f32, -} - -#[derive(Debug, Deserialize)] -pub struct RawTagConfig { - pub tags: std::collections::HashMap, -} - -pub fn load_tags_config>(path: P) -> anyhow::Result> { - let contents = std::fs::read_to_string(path)?; - let raw: RawTagConfig = serde_yaml::from_str(&contents) - .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; - - let mut tags = Vec::new(); - for (name, raw_tag) in raw.tags { - tags.push(TagDefinition { - name, - description: raw_tag.description, - examples: raw_tag.examples, - include_keywords: raw_tag.include_keywords, - exclude_keywords: raw_tag.exclude_keywords, - negative_examples: raw_tag.negative_examples, - threshold: raw_tag.threshold, - }); - } - Ok(tags) -} - -/// Lightweight embedding service powered by ONNX Runtime -pub struct EmbeddingService { - session: Session, - tokenizer: Tokenizer, -} - -impl EmbeddingService { - pub fn new>(model_path: P, tokenizer_path: P) -> anyhow::Result { - let tokenizer = Tokenizer::from_file(tokenizer_path.as_ref()) - .map_err(|e| anyhow::anyhow!("Failed to load tokenizer: {}", e))?; - - let session = Session::builder()?.commit_from_file(model_path)?; - - Ok(Self { session, tokenizer }) - } - - /// Embed text using the configured model with mean pooling over last hidden state - pub fn embed(&mut self, text: &str) -> anyhow::Result> { - // Tokenize - let encoding = self - .tokenizer - .encode(text, true) - .map_err(|e| anyhow::anyhow!("Tokenizer encode failed: {}", e))?; - - let ids = encoding.get_ids(); - let mask = encoding.get_attention_mask(); - let type_ids = encoding.get_type_ids(); - - let input_ids: Vec = ids.iter().map(|&x| x as i64).collect(); - let attention_mask_vec: Vec = mask.iter().map(|&x| x as i64).collect(); - let token_type_vec: Vec = type_ids.iter().map(|&x| x as i64).collect(); - - let outputs = self.session.run(inputs![ - "input_ids" => Value::from_array((vec![1_i64, ids.len() as i64], input_ids))?, - "attention_mask" => Value::from_array((vec![1_i64, mask.len() as i64], attention_mask_vec))?, - "token_type_ids" => Value::from_array((vec![1_i64, type_ids.len() as i64], token_type_vec))?, - ])?; - - // Use last_hidden_state and mean-pool - let hidden = outputs["last_hidden_state"].try_extract_array::()?; - - // hidden shape: [batch, seq_len, hidden_dim] - let shape = hidden.shape(); - if shape.len() != 3 { - return Err(anyhow::anyhow!("Unexpected embedding shape {:?}", shape)); - } - let seq_len = shape[1]; - let hidden_dim = shape[2]; - - let mut pooled = vec![0f32; hidden_dim]; - for i in 0..seq_len { - for h in 0..hidden_dim { - pooled[h] += hidden[[0, i, h]]; - } - } - for h in 0..hidden_dim { - pooled[h] /= seq_len as f32; - } - let pooled = Array1::from(pooled); - - Ok(pooled) - } - - pub fn cosine_similarity(&self, a: &Array1, b: &Array1) -> f32 { - let dot = a.dot(b); - let norm_a = a.dot(a).sqrt(); - let norm_b = b.dot(b).sqrt(); - dot / (norm_a * norm_b).max(1e-9) - } -} - -/// Return all keywords from the list that appear in the text -/// (case-insensitive, word-boundary aware). -fn find_matching_keywords(text: &str, keywords: &[String]) -> Vec { - let text_lower = text.to_lowercase(); - let mut matches = Vec::new(); - - for keyword in keywords { - let keyword_lower = keyword.to_lowercase(); - // Check for exact word match or phrase match - // For multi-word keywords, use contains - // For single-word keywords, check word boundaries - let is_match = if keyword_lower.contains(' ') { - // Multi-word phrase: use contains - text_lower.contains(&keyword_lower) - } else { - // Single word: check word boundaries to avoid partial matches - // e.g., "trans" should not match "transport" or "transfer" - // But "lgbtq" should match "lgbtq+" (with punctuation) - let escaped = regex::escape(&keyword_lower); - let pattern = format!(r"\b{}(?:\+|\b)", escaped); - Regex::new(&pattern) - .map(|re| re.is_match(&text_lower)) - .unwrap_or_else(|_| text_lower.contains(&keyword_lower)) - }; - - if is_match { - matches.push(keyword.clone()); - } - } - - matches -} - -/// Matcher that precomputes tag embeddings and scores logs against them -pub struct TagMatcher { - embeddings: std::sync::Mutex, - tag_embeddings: HashMap>, - example_embeddings: HashMap>>, - negative_example_embeddings: HashMap>>, - tags: HashMap, -} - -impl TagMatcher { - pub fn from_files>( - model_path: P, - tokenizer_path: P, - tags_path: P, - ) -> anyhow::Result { - let mut embeddings = EmbeddingService::new(&model_path, &tokenizer_path)?; - - // Load tags YAML - let tag_defs = load_tags_config(tags_path)?; - - // Precompute tag embeddings - let mut tag_embeddings = HashMap::new(); - let mut example_embeddings = HashMap::new(); - let mut negative_example_embeddings = HashMap::new(); - let mut tags_map = HashMap::new(); - - for tag in tag_defs { - // Combine description + examples for richer embedding - let mut text = tag.description.clone(); - if !tag.examples.is_empty() { - text.push_str(" Examples: "); - text.push_str(&tag.examples.join(" | ")); - } - let emb = embeddings.embed(&text)?; - tag_embeddings.insert(tag.name.clone(), emb); - - // Precompute embeddings for individual examples - let mut example_embs = Vec::new(); - for example in &tag.examples { - let example_emb = embeddings.embed(example)?; - example_embs.push(example_emb); - } - example_embeddings.insert(tag.name.clone(), example_embs); - - // Precompute embeddings for negative examples - let mut neg_example_embs = Vec::new(); - for neg_example in &tag.negative_examples { - let neg_emb = embeddings.embed(neg_example)?; - neg_example_embs.push(neg_emb); - } - negative_example_embeddings.insert(tag.name.clone(), neg_example_embs); - - tags_map.insert(tag.name.clone(), tag); - } - - Ok(Self { - embeddings: std::sync::Mutex::new(embeddings), - tag_embeddings, - example_embeddings, - negative_example_embeddings, - tags: tags_map, - }) - } - - /// Calculate composite score using multiple signals - fn calculate_composite_score( - &self, - log_embedding: &Array1, - log_text: &str, - tag_name: &str, - tag_def: &TagDefinition, - embeddings: &mut EmbeddingService, - ) -> ScoreBreakdown { - // 4. Exclude keywords: zero out if exclude keywords match (check first). - // We don't currently expose which exclude keyword matched; we just block the tag. - if !tag_def.exclude_keywords.is_empty() { - let exclude_matches = find_matching_keywords(log_text, &tag_def.exclude_keywords); - if !exclude_matches.is_empty() { - return ScoreBreakdown { - final_score: 0.0, - base_embedding: None, - example_similarity: None, - keyword_match: Vec::new(), - negative_penalty: 0.0, - }; - } - } - - // 3. Include keywords: if keywords match, they have the heaviest impact - let include_matches = if tag_def.include_keywords.is_empty() { - Vec::new() - } else { - find_matching_keywords(log_text, &tag_def.include_keywords) - }; - let has_keyword_match = !include_matches.is_empty(); - - let mut score = 0.0; - let mut weight_sum = 0.0; - let mut base_embedding_score: Option = None; - let mut example_similarity_score: Option = None; - - // 1. Base score: embedding similarity to description + examples - // Industry standard: embeddings are the primary signal - if let Some(tag_emb) = self.tag_embeddings.get(tag_name) { - let base_score = embeddings.cosine_similarity(log_embedding, tag_emb); - base_embedding_score = Some(base_score); - // Weight embeddings less when keywords match (keywords will add boost) - let weight = if has_keyword_match { 0.35 } else { 0.5 }; - score += base_score * weight; - weight_sum += weight; - } - - // 2. Example similarity: max similarity to individual examples - if let Some(example_embs) = self.example_embeddings.get(tag_name) { - if !example_embs.is_empty() { - let max_example_score = example_embs - .iter() - .map(|example_emb| embeddings.cosine_similarity(log_embedding, example_emb)) - .fold(0.0f32, f32::max); - example_similarity_score = Some(max_example_score); - let weight = if has_keyword_match { 0.25 } else { 0.35 }; - score += max_example_score * weight; - weight_sum += weight; - } - } - - // 3. Keyword boost: additive boost when keywords match - // Keywords are explicit signals and should have strong weight - // This ensures keyword matches are strong but still respect embedding quality - if has_keyword_match { - // Strong boost for keywords - they are explicit signals from the tag definition - // Higher than typical industry systems because keywords are curated and highly reliable - let keyword_boost = 0.4; - score += keyword_boost; - weight_sum += keyword_boost; - } - - // Normalize the weighted combination - if weight_sum > 0.0 { - score = score / weight_sum; - } - - // If keywords matched, ensure minimum score meets threshold (before negative penalty) - // Keywords are explicit signals, so they should guarantee threshold unless negated - if has_keyword_match { - score = score.max(tag_def.threshold); - } - - // 5. Negative examples: penalty if too similar to negative examples - let mut negative_penalty = 0.0f32; - if let Some(neg_example_embs) = self.negative_example_embeddings.get(tag_name) { - if !neg_example_embs.is_empty() { - let max_neg_score = neg_example_embs - .iter() - .map(|neg_emb| embeddings.cosine_similarity(log_embedding, neg_emb)) - .fold(0.0f32, f32::max); - // Apply penalty: subtract up to 0.25 based on negative similarity - // Higher negative similarity = stronger penalty - negative_penalty = max_neg_score * 0.25; - score = (score - negative_penalty).max(0.0); - } - } - - // Clamp to [0, 1] - let final_score = score.min(1.0).max(0.0); - - ScoreBreakdown { - final_score: final_score as f64, - base_embedding: base_embedding_score.map(|s| s as f64), - example_similarity: example_similarity_score.map(|s| s as f64), - keyword_match: include_matches, - negative_penalty: negative_penalty as f64, - } - } - - /// Match a serde_json::Value log entry against tags, returning (tag, score_breakdown) - pub fn match_json_value( - &self, - value: &serde_json::Value, - ) -> anyhow::Result> { - let text = ocd_files_select_default(value); - let mut embeddings = self.embeddings.lock().unwrap(); - let log_embedding = embeddings.embed(&text)?; - - let mut results = Vec::new(); - for (name, tag_def) in &self.tags { - let score_breakdown = self.calculate_composite_score( - &log_embedding, - &text, - name, - tag_def, - &mut *embeddings, - ); - if score_breakdown.final_score >= tag_def.threshold as f64 { - results.push((name.clone(), score_breakdown)); - } - } - - // Sort descending by final score - results.sort_by(|a, b| { - b.1.final_score - .partial_cmp(&a.1.final_score) - .unwrap_or(std::cmp::Ordering::Equal) - }); - Ok(results) - } - - /// Access tag definitions (name -> definition) - pub fn tag_definitions(&self) -> &HashMap { - &self.tags - } -} - -/// Keyword-based fallback matcher when embedding mode is unavailable -/// Matches tags based on include_keywords and exclude_keywords from tag definitions -pub fn match_tags_keywords( - tag_defs: &[TagDefinition], - json_entry: &serde_json::Value, -) -> Vec<(String, ScoreBreakdown)> { - let text = ocd_files_select_default(json_entry); - let text_lower = text.to_lowercase(); - - let mut results = Vec::new(); - - for tag_def in tag_defs { - // Check exclude_keywords first - if any match, skip this tag - if !tag_def.exclude_keywords.is_empty() { - let exclude_matches = find_matching_keywords(&text_lower, &tag_def.exclude_keywords); - if !exclude_matches.is_empty() { - continue; - } - } - - // Check include_keywords - if any match, create a match - let include_matches = if tag_def.include_keywords.is_empty() { - Vec::new() - } else { - find_matching_keywords(&text_lower, &tag_def.include_keywords) - }; - - if !include_matches.is_empty() { - // If keywords match, assign a score based on threshold - // Use threshold as the base score, or 0.6 if threshold is lower - let score = tag_def.threshold.max(0.6) as f64; - - // Only include if score meets threshold - if score >= tag_def.threshold as f64 { - results.push(( - tag_def.name.clone(), - ScoreBreakdown { - final_score: score, - base_embedding: None, - example_similarity: None, - keyword_match: include_matches, - negative_penalty: 0.0, - }, - )); - } - } - } - - // Sort by score descending - results.sort_by(|a, b| { - b.1.final_score - .partial_cmp(&a.1.final_score) - .unwrap_or(std::cmp::Ordering::Equal) - }); - - results -} diff --git a/actions/govbot/src/filters/ak-legislation/default.rs b/actions/govbot/src/filters/ak-legislation/default.rs index 882b5efe..8ae70400 100644 --- a/actions/govbot/src/filters/ak-legislation/default.rs +++ b/actions/govbot/src/filters/ak-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ak-legislation (Alaska): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ak --limit=100` +// `just govbot source --repos=ak --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ak --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ak --limit=100 --filter=default` // // Current filter removes: routine committee abbreviations, minutes, hearings, referrals, and filing actions // ====================================== diff --git a/actions/govbot/src/filters/al-legislation/default.rs b/actions/govbot/src/filters/al-legislation/default.rs index 4884e3db..6432c71c 100644 --- a/actions/govbot/src/filters/al-legislation/default.rs +++ b/actions/govbot/src/filters/al-legislation/default.rs @@ -4,7 +4,7 @@ // to make the output more focused on important legislative actions. // // TO UPDATE THIS FILTER: -// 1. Run: `just govbot logs --repos=al --limit=100` to see recent log entries +// 1. Run: `just govbot source --repos=al --limit=100` to see recent log entries // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=al --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=al --limit=100 --filter=default` // // Current filter removes: routine filing, first reading/referral, and pending committee status updates // ====================================== diff --git a/actions/govbot/src/filters/ar-legislation/default.rs b/actions/govbot/src/filters/ar-legislation/default.rs index 99fcc315..fc77eef5 100644 --- a/actions/govbot/src/filters/ar-legislation/default.rs +++ b/actions/govbot/src/filters/ar-legislation/default.rs @@ -4,7 +4,7 @@ // to make the output more focused on important legislative actions. // // TO UPDATE THIS FILTER: -// 1. Run: `just govbot logs --repos=ar --limit=100` to see recent log entries +// 1. Run: `just govbot source --repos=ar --limit=100` to see recent log entries // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ar --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ar --limit=100 --filter=default` // // Current filter removes: routine filing, first reading/referrals, and routine procedural actions // ====================================== diff --git a/actions/govbot/src/filters/az-legislation/default.rs b/actions/govbot/src/filters/az-legislation/default.rs index 6837a4bd..db902452 100644 --- a/actions/govbot/src/filters/az-legislation/default.rs +++ b/actions/govbot/src/filters/az-legislation/default.rs @@ -4,7 +4,7 @@ // to make the output more focused on important legislative actions. // // TO UPDATE THIS FILTER: -// 1. Run: `just govbot logs --repos=az --limit=100` to see recent log entries +// 1. Run: `just govbot source --repos=az --limit=100` to see recent log entries // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,14 +15,14 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=az --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=az --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== // Filter for az-legislation (Arizona) // Note: Repository not found in test data - keeping default filter for now -// TODO: Analyze output from `just govbot logs --repos=az --limit=100` when data is available +// TODO: Analyze output from `just govbot source --repos=az --limit=100` when data is available use crate::filter::FilterResult; use serde_json::Value; diff --git a/actions/govbot/src/filters/ca-legislation/default.rs b/actions/govbot/src/filters/ca-legislation/default.rs index 335e7ff3..0a775e97 100644 --- a/actions/govbot/src/filters/ca-legislation/default.rs +++ b/actions/govbot/src/filters/ca-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ca-legislation (California): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ca --limit=100` +// `just govbot source --repos=ca --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ca --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ca --limit=100 --filter=default` // // Current filter removes: routine committee referrals, introductions, and routine reading actions // ====================================== diff --git a/actions/govbot/src/filters/co-legislation/default.rs b/actions/govbot/src/filters/co-legislation/default.rs index a3ad5579..449842d4 100644 --- a/actions/govbot/src/filters/co-legislation/default.rs +++ b/actions/govbot/src/filters/co-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR co-legislation (Colorado): // 1. First, gather real data by running this command: -// `just govbot logs --repos=co --limit=100` +// `just govbot source --repos=co --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=co --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=co --limit=100 --filter=default` // // Current filter removes: routine introductions and committee referrals // ====================================== diff --git a/actions/govbot/src/filters/ct-legislation/default.rs b/actions/govbot/src/filters/ct-legislation/default.rs index d4fe808f..b06ce867 100644 --- a/actions/govbot/src/filters/ct-legislation/default.rs +++ b/actions/govbot/src/filters/ct-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ct-legislation (Connecticut): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ct --limit=100` +// `just govbot source --repos=ct --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ct --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ct --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/de-legislation/default.rs b/actions/govbot/src/filters/de-legislation/default.rs index f9fa1cf5..3e991a07 100644 --- a/actions/govbot/src/filters/de-legislation/default.rs +++ b/actions/govbot/src/filters/de-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR de-legislation (Delaware): // 1. First, gather real data by running this command: -// `just govbot logs --repos=de --limit=100` +// `just govbot source --repos=de --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=de --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=de --limit=100 --filter=default` // // Current filter removes: routine introductions, committee assignments, and "Not Worked" status updates // ====================================== diff --git a/actions/govbot/src/filters/fl-legislation/default.rs b/actions/govbot/src/filters/fl-legislation/default.rs index 7006a46b..164b5ee0 100644 --- a/actions/govbot/src/filters/fl-legislation/default.rs +++ b/actions/govbot/src/filters/fl-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR fl-legislation (Florida): // 1. First, gather real data by running this command: -// `just govbot logs --repos=fl --limit=100` +// `just govbot source --repos=fl --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=fl --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=fl --limit=100 --filter=default` // // Current filter removes: routine filings, referrals, and committee status updates // ====================================== diff --git a/actions/govbot/src/filters/ga-legislation/default.rs b/actions/govbot/src/filters/ga-legislation/default.rs index 3ea1da95..611c3adf 100644 --- a/actions/govbot/src/filters/ga-legislation/default.rs +++ b/actions/govbot/src/filters/ga-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ga-legislation (Georgia): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ga --limit=100` +// `just govbot source --repos=ga --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ga --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ga --limit=100 --filter=default` // // Current filter removes: routine hopper entries, first readers, and routine referrals // ====================================== diff --git a/actions/govbot/src/filters/gu-legislation/default.rs b/actions/govbot/src/filters/gu-legislation/default.rs index 6b4156f3..1de32ba1 100644 --- a/actions/govbot/src/filters/gu-legislation/default.rs +++ b/actions/govbot/src/filters/gu-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR gu-legislation (Guam): // 1. First, gather real data by running this command: -// `just govbot logs --repos=gu --limit=100` +// `just govbot source --repos=gu --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=gu --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=gu --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/hi-legislation/default.rs b/actions/govbot/src/filters/hi-legislation/default.rs index ea7af414..815f8e79 100644 --- a/actions/govbot/src/filters/hi-legislation/default.rs +++ b/actions/govbot/src/filters/hi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR hi-legislation (Hawaii): // 1. First, gather real data by running this command: -// `just govbot logs --repos=hi --limit=100` +// `just govbot source --repos=hi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=hi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=hi --limit=100 --filter=default` // // Current filter removes: routine introductions, first readings, and committee referral patterns // ====================================== diff --git a/actions/govbot/src/filters/ia-legislation/default.rs b/actions/govbot/src/filters/ia-legislation/default.rs index 4c180fcc..05729144 100644 --- a/actions/govbot/src/filters/ia-legislation/default.rs +++ b/actions/govbot/src/filters/ia-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ia-legislation (Iowa): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ia --limit=100` +// `just govbot source --repos=ia --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ia --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ia --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and subcommittee notifications // ====================================== diff --git a/actions/govbot/src/filters/id-legislation/default.rs b/actions/govbot/src/filters/id-legislation/default.rs index da05b8e2..5668ef6c 100644 --- a/actions/govbot/src/filters/id-legislation/default.rs +++ b/actions/govbot/src/filters/id-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR id-legislation (Idaho): // 1. First, gather real data by running this command: -// `just govbot logs --repos=id --limit=100` +// `just govbot source --repos=id --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=id --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=id --limit=100 --filter=default` // // Current filter removes: routine introductions, readings, and status updates // ====================================== diff --git a/actions/govbot/src/filters/il-legislation/default.rs b/actions/govbot/src/filters/il-legislation/default.rs index 43184df3..d1892ef4 100644 --- a/actions/govbot/src/filters/il-legislation/default.rs +++ b/actions/govbot/src/filters/il-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR il-legislation (Illinois): // 1. First, gather real data by running this command: -// `just govbot logs --repos=il --limit=100` +// `just govbot source --repos=il --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=il --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=il --limit=100 --filter=default` // // Current filter removes: routine co-sponsor additions, Rules Committee referrals, and filings // ====================================== diff --git a/actions/govbot/src/filters/in-legislation/default.rs b/actions/govbot/src/filters/in-legislation/default.rs index fc539141..8d2342cb 100644 --- a/actions/govbot/src/filters/in-legislation/default.rs +++ b/actions/govbot/src/filters/in-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR in-legislation (Indiana): // 1. First, gather real data by running this command: -// `just govbot logs --repos=in --limit=100` +// `just govbot source --repos=in --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=in --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=in --limit=100 --filter=default` // // Current filter removes: routine first readings, referrals, and authorship notifications // ====================================== diff --git a/actions/govbot/src/filters/ks-legislation/default.rs b/actions/govbot/src/filters/ks-legislation/default.rs index 955526db..9255ddbf 100644 --- a/actions/govbot/src/filters/ks-legislation/default.rs +++ b/actions/govbot/src/filters/ks-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ks-legislation (Kansas): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ks --limit=100` +// `just govbot source --repos=ks --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ks --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ks --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and hearing notifications // ====================================== diff --git a/actions/govbot/src/filters/ky-legislation/default.rs b/actions/govbot/src/filters/ky-legislation/default.rs index 3e1e83a6..6005bb88 100644 --- a/actions/govbot/src/filters/ky-legislation/default.rs +++ b/actions/govbot/src/filters/ky-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ky-legislation (Kentucky): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ky --limit=100` +// `just govbot source --repos=ky --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ky --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ky --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/la-legislation/default.rs b/actions/govbot/src/filters/la-legislation/default.rs index eac82569..7f01a69e 100644 --- a/actions/govbot/src/filters/la-legislation/default.rs +++ b/actions/govbot/src/filters/la-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR la-legislation (Louisiana): // 1. First, gather real data by running this command: -// `just govbot logs --repos=la --limit=100` +// `just govbot source --repos=la --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=la --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=la --limit=100 --filter=default` // // Current filter removes: routine prefiling, referrals, and reading actions // ====================================== diff --git a/actions/govbot/src/filters/ma-legislation/default.rs b/actions/govbot/src/filters/ma-legislation/default.rs index f5f5d87b..95b8b1bc 100644 --- a/actions/govbot/src/filters/ma-legislation/default.rs +++ b/actions/govbot/src/filters/ma-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ma-legislation (Massachusetts): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ma --limit=100` +// `just govbot source --repos=ma --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ma --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ma --limit=100 --filter=default` // // Current filter removes: routine hearing scheduling, concurrences, and referrals // ====================================== diff --git a/actions/govbot/src/filters/md-legislation/default.rs b/actions/govbot/src/filters/md-legislation/default.rs index 2a8465c7..ac50ac1b 100644 --- a/actions/govbot/src/filters/md-legislation/default.rs +++ b/actions/govbot/src/filters/md-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR md-legislation (Maryland): // 1. First, gather real data by running this command: -// `just govbot logs --repos=md --limit=100` +// `just govbot source --repos=md --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=md --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=md --limit=100 --filter=default` // // Current filter removes: routine pre-filings, first readings, and hearing notifications // ====================================== diff --git a/actions/govbot/src/filters/me-legislation/default.rs b/actions/govbot/src/filters/me-legislation/default.rs index 27244f4e..922cf2d4 100644 --- a/actions/govbot/src/filters/me-legislation/default.rs +++ b/actions/govbot/src/filters/me-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR me-legislation (Maine): // 1. First, gather real data by running this command: -// `just govbot logs --repos=me --limit=100` +// `just govbot source --repos=me --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=me --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=me --limit=100 --filter=default` // // Current filter removes: routine referrals, author additions, and status updates // ====================================== diff --git a/actions/govbot/src/filters/mi-legislation/default.rs b/actions/govbot/src/filters/mi-legislation/default.rs index 8c97cfd6..266bebc7 100644 --- a/actions/govbot/src/filters/mi-legislation/default.rs +++ b/actions/govbot/src/filters/mi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mi-legislation (Michigan): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mi --limit=100` +// `just govbot source --repos=mi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mi --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and first readings // ====================================== diff --git a/actions/govbot/src/filters/mn-legislation/default.rs b/actions/govbot/src/filters/mn-legislation/default.rs index bc7d5ffe..2595fec1 100644 --- a/actions/govbot/src/filters/mn-legislation/default.rs +++ b/actions/govbot/src/filters/mn-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mn-legislation (Minnesota): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mn --limit=100` +// `just govbot source --repos=mn --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mn --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mn --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and author additions // ====================================== diff --git a/actions/govbot/src/filters/mo-legislation/default.rs b/actions/govbot/src/filters/mo-legislation/default.rs index 37bdb0d4..c67e4bce 100644 --- a/actions/govbot/src/filters/mo-legislation/default.rs +++ b/actions/govbot/src/filters/mo-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mo-legislation (Missouri): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mo --limit=100` +// `just govbot source --repos=mo --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mo --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mo --limit=100 --filter=default` // // Current filter removes: routine prefiling actions // ====================================== diff --git a/actions/govbot/src/filters/mp-legislation/default.rs b/actions/govbot/src/filters/mp-legislation/default.rs index d69f7119..44c86204 100644 --- a/actions/govbot/src/filters/mp-legislation/default.rs +++ b/actions/govbot/src/filters/mp-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mp-legislation (Northern Mariana Islands): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mp --limit=100` +// `just govbot source --repos=mp --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mp --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mp --limit=100 --filter=default` // // Current filter removes: routine introduction and reading actions // ====================================== diff --git a/actions/govbot/src/filters/ms-legislation/default.rs b/actions/govbot/src/filters/ms-legislation/default.rs index 105a3154..1cc8e68e 100644 --- a/actions/govbot/src/filters/ms-legislation/default.rs +++ b/actions/govbot/src/filters/ms-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ms-legislation (Mississippi): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ms --limit=100` +// `just govbot source --repos=ms --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ms --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ms --limit=100 --filter=default` // // Current filter removes: routine referrals and status updates // ====================================== diff --git a/actions/govbot/src/filters/mt-legislation/default.rs b/actions/govbot/src/filters/mt-legislation/default.rs index c1e2dd58..a9b25fdf 100644 --- a/actions/govbot/src/filters/mt-legislation/default.rs +++ b/actions/govbot/src/filters/mt-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR mt-legislation (Montana): // 1. First, gather real data by running this command: -// `just govbot logs --repos=mt --limit=100` +// `just govbot source --repos=mt --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=mt --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=mt --limit=100 --filter=default` // // Current filter removes: routine hearings, scheduling, and draft status updates // ====================================== diff --git a/actions/govbot/src/filters/nc-legislation/default.rs b/actions/govbot/src/filters/nc-legislation/default.rs index d7036d60..e479fb2a 100644 --- a/actions/govbot/src/filters/nc-legislation/default.rs +++ b/actions/govbot/src/filters/nc-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nc-legislation (North Carolina): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nc --limit=100` +// `just govbot source --repos=nc --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nc --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nc --limit=100 --filter=default` // // Current filter removes: routine referrals, readings, and status updates // ====================================== diff --git a/actions/govbot/src/filters/nd-legislation/default.rs b/actions/govbot/src/filters/nd-legislation/default.rs index d804b3a1..49c3b75a 100644 --- a/actions/govbot/src/filters/nd-legislation/default.rs +++ b/actions/govbot/src/filters/nd-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nd-legislation (North Dakota): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nd --limit=100` +// `just govbot source --repos=nd --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nd --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nd --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and committee hearings // ====================================== diff --git a/actions/govbot/src/filters/ne-legislation/default.rs b/actions/govbot/src/filters/ne-legislation/default.rs index 5115aeef..a00381ba 100644 --- a/actions/govbot/src/filters/ne-legislation/default.rs +++ b/actions/govbot/src/filters/ne-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ne-legislation (Nebraska): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ne --limit=100` +// `just govbot source --repos=ne --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ne --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ne --limit=100 --filter=default` // // Current filter removes: routine referrals, hearing notifications, and filing actions // ====================================== diff --git a/actions/govbot/src/filters/nh-legislation/default.rs b/actions/govbot/src/filters/nh-legislation/default.rs index d33675c1..018b7344 100644 --- a/actions/govbot/src/filters/nh-legislation/default.rs +++ b/actions/govbot/src/filters/nh-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nh-legislation (New Hampshire): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nh --limit=100` +// `just govbot source --repos=nh --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nh --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nh --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and hearing scheduling // ====================================== diff --git a/actions/govbot/src/filters/nj-legislation/default.rs b/actions/govbot/src/filters/nj-legislation/default.rs index 3948854b..76a55526 100644 --- a/actions/govbot/src/filters/nj-legislation/default.rs +++ b/actions/govbot/src/filters/nj-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nj-legislation (New Jersey): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nj --limit=100` +// `just govbot source --repos=nj --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nj --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nj --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and routine status updates // ====================================== diff --git a/actions/govbot/src/filters/nm-legislation/default.rs b/actions/govbot/src/filters/nm-legislation/default.rs index b723ddc6..5b716ad4 100644 --- a/actions/govbot/src/filters/nm-legislation/default.rs +++ b/actions/govbot/src/filters/nm-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nm-legislation (New Mexico): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nm --limit=100` +// `just govbot source --repos=nm --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nm --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nm --limit=100 --filter=default` // // Current filter removes: routine committee referrals and routine committee actions // ====================================== diff --git a/actions/govbot/src/filters/nv-legislation/default.rs b/actions/govbot/src/filters/nv-legislation/default.rs index 84d8f8a4..fb92075e 100644 --- a/actions/govbot/src/filters/nv-legislation/default.rs +++ b/actions/govbot/src/filters/nv-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR nv-legislation (Nevada): // 1. First, gather real data by running this command: -// `just govbot logs --repos=nv --limit=100` +// `just govbot source --repos=nv --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=nv --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=nv --limit=100 --filter=default` // // Current filter removes: routine prefiling, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/ny-legislation/default.rs b/actions/govbot/src/filters/ny-legislation/default.rs index 35b741f9..e32b8edd 100644 --- a/actions/govbot/src/filters/ny-legislation/default.rs +++ b/actions/govbot/src/filters/ny-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ny-legislation (New York): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ny --limit=100` +// `just govbot source --repos=ny --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ny --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ny --limit=100 --filter=default` // // Current filter removes: routine referrals, introductions, and status updates // ====================================== diff --git a/actions/govbot/src/filters/oh-legislation/default.rs b/actions/govbot/src/filters/oh-legislation/default.rs index 7b5a2921..886b4688 100644 --- a/actions/govbot/src/filters/oh-legislation/default.rs +++ b/actions/govbot/src/filters/oh-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR oh-legislation (Ohio): // 1. First, gather real data by running this command: -// `just govbot logs --repos=oh --limit=100` +// `just govbot source --repos=oh --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=oh --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=oh --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/ok-legislation/default.rs b/actions/govbot/src/filters/ok-legislation/default.rs index 19f1ad33..ca2b6fc4 100644 --- a/actions/govbot/src/filters/ok-legislation/default.rs +++ b/actions/govbot/src/filters/ok-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ok-legislation (Oklahoma): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ok --limit=100` +// `just govbot source --repos=ok --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ok --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ok --limit=100 --filter=default` // // Current filter removes: routine introductions, readings, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/or-legislation/default.rs b/actions/govbot/src/filters/or-legislation/default.rs index 24599c51..fdf051b1 100644 --- a/actions/govbot/src/filters/or-legislation/default.rs +++ b/actions/govbot/src/filters/or-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR or-legislation (Oregon): // 1. First, gather real data by running this command: -// `just govbot logs --repos=or --limit=100` +// `just govbot source --repos=or --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=or --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=or --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/pa-legislation/default.rs b/actions/govbot/src/filters/pa-legislation/default.rs index 44b820e5..87c7f1b2 100644 --- a/actions/govbot/src/filters/pa-legislation/default.rs +++ b/actions/govbot/src/filters/pa-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR pa-legislation (Pennsylvania): // 1. First, gather real data by running this command: -// `just govbot logs --repos=pa --limit=100` +// `just govbot source --repos=pa --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=pa --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=pa --limit=100 --filter=default` // // Current filter removes: routine referrals, readings, and status updates // ====================================== diff --git a/actions/govbot/src/filters/pr-legislation/default.rs b/actions/govbot/src/filters/pr-legislation/default.rs index 932a3d0a..badaca7d 100644 --- a/actions/govbot/src/filters/pr-legislation/default.rs +++ b/actions/govbot/src/filters/pr-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR pr-legislation (Puerto Rico): // 1. First, gather real data by running this command: -// `just govbot logs --repos=pr --limit=100` +// `just govbot source --repos=pr --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=pr --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=pr --limit=100 --filter=default` // // Current filter removes: routine referrals, introductions, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/ri-legislation/default.rs b/actions/govbot/src/filters/ri-legislation/default.rs index 1c84699b..dcae2fe6 100644 --- a/actions/govbot/src/filters/ri-legislation/default.rs +++ b/actions/govbot/src/filters/ri-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ri-legislation (Rhode Island): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ri --limit=100` +// `just govbot source --repos=ri --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ri --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ri --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and scheduling // ====================================== diff --git a/actions/govbot/src/filters/sc-legislation/default.rs b/actions/govbot/src/filters/sc-legislation/default.rs index e0c2c3e7..e4503522 100644 --- a/actions/govbot/src/filters/sc-legislation/default.rs +++ b/actions/govbot/src/filters/sc-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR sc-legislation (South Carolina): // 1. First, gather real data by running this command: -// `just govbot logs --repos=sc --limit=100` +// `just govbot source --repos=sc --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=sc --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=sc --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, filings, and readings // ====================================== diff --git a/actions/govbot/src/filters/sd-legislation/default.rs b/actions/govbot/src/filters/sd-legislation/default.rs index de9ce815..352b8732 100644 --- a/actions/govbot/src/filters/sd-legislation/default.rs +++ b/actions/govbot/src/filters/sd-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR sd-legislation (South Dakota): // 1. First, gather real data by running this command: -// `just govbot logs --repos=sd --limit=100` +// `just govbot source --repos=sd --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=sd --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=sd --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/tn-legislation/default.rs b/actions/govbot/src/filters/tn-legislation/default.rs index e57a52ea..b477ca6b 100644 --- a/actions/govbot/src/filters/tn-legislation/default.rs +++ b/actions/govbot/src/filters/tn-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR tn-legislation (Tennessee): // 1. First, gather real data by running this command: -// `just govbot logs --repos=tn --limit=100` +// `just govbot source --repos=tn --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=tn --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=tn --limit=100 --filter=default` // // Current filter removes: routine filings, introductions, referrals, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/tx-legislation/default.rs b/actions/govbot/src/filters/tx-legislation/default.rs index abba88b2..55d8095e 100644 --- a/actions/govbot/src/filters/tx-legislation/default.rs +++ b/actions/govbot/src/filters/tx-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR tx-legislation (Texas): // 1. First, gather real data by running this command: -// `just govbot logs --repos=tx --limit=100` +// `just govbot source --repos=tx --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=tx --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=tx --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/usa-legislation/default.rs b/actions/govbot/src/filters/usa-legislation/default.rs index 9b2c63b4..5faedcba 100644 --- a/actions/govbot/src/filters/usa-legislation/default.rs +++ b/actions/govbot/src/filters/usa-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR usa-legislation (United States): // 1. First, gather real data by running this command: -// `just govbot logs --repos=usa --limit=100` +// `just govbot source --repos=usa --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,13 +15,13 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=usa --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=usa --limit=100 --filter=default` // // Current filter: TODO - Analyze output to identify noisy patterns // ====================================== // Filter for usa-legislation (United States) -// TODO: Analyze output from `just govbot logs --limit=10` to identify noisy patterns +// TODO: Analyze output from `just govbot source --limit=10` to identify noisy patterns // and add specific filters for this locale use crate::filter::FilterResult; diff --git a/actions/govbot/src/filters/ut-legislation/default.rs b/actions/govbot/src/filters/ut-legislation/default.rs index 6aba440e..11ff6475 100644 --- a/actions/govbot/src/filters/ut-legislation/default.rs +++ b/actions/govbot/src/filters/ut-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR ut-legislation (Utah): // 1. First, gather real data by running this command: -// `just govbot logs --repos=ut --limit=100` +// `just govbot source --repos=ut --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=ut --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=ut --limit=100 --filter=default` // // Current filter removes: routine readings, status updates, and transfers // ====================================== diff --git a/actions/govbot/src/filters/va-legislation/default.rs b/actions/govbot/src/filters/va-legislation/default.rs index 089d2dc1..561dc239 100644 --- a/actions/govbot/src/filters/va-legislation/default.rs +++ b/actions/govbot/src/filters/va-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR va-legislation (Virginia): // 1. First, gather real data by running this command: -// `just govbot logs --repos=va --limit=100` +// `just govbot source --repos=va --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=va --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=va --limit=100 --filter=default` // // Current filter: No test data available - placeholder filter // ====================================== diff --git a/actions/govbot/src/filters/vi-legislation/default.rs b/actions/govbot/src/filters/vi-legislation/default.rs index 2ede5768..8faf3efd 100644 --- a/actions/govbot/src/filters/vi-legislation/default.rs +++ b/actions/govbot/src/filters/vi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR vi-legislation (U.S. Virgin Islands): // 1. First, gather real data by running this command: -// `just govbot logs --repos=vi --limit=100` +// `just govbot source --repos=vi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=vi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=vi --limit=100 --filter=default` // // Current filter removes: routine status updates and transfers // ====================================== diff --git a/actions/govbot/src/filters/vt-legislation/default.rs b/actions/govbot/src/filters/vt-legislation/default.rs index 78e6a091..61256700 100644 --- a/actions/govbot/src/filters/vt-legislation/default.rs +++ b/actions/govbot/src/filters/vt-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR vt-legislation (Vermont): // 1. First, gather real data by running this command: -// `just govbot logs --repos=vt --limit=100` +// `just govbot source --repos=vt --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=vt --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=vt --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and calendar scheduling // ====================================== diff --git a/actions/govbot/src/filters/wa-legislation/default.rs b/actions/govbot/src/filters/wa-legislation/default.rs index ff086464..3879a7dc 100644 --- a/actions/govbot/src/filters/wa-legislation/default.rs +++ b/actions/govbot/src/filters/wa-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wa-legislation (Washington): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wa --limit=100` +// `just govbot source --repos=wa --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wa --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wa --limit=100 --filter=default` // // Current filter removes: routine readings, referrals, and scheduling // ====================================== diff --git a/actions/govbot/src/filters/wi-legislation/default.rs b/actions/govbot/src/filters/wi-legislation/default.rs index 6c04c657..d0915f04 100644 --- a/actions/govbot/src/filters/wi-legislation/default.rs +++ b/actions/govbot/src/filters/wi-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wi-legislation (Wisconsin): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wi --limit=100` +// `just govbot source --repos=wi --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wi --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wi --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/filters/wv-legislation/default.rs b/actions/govbot/src/filters/wv-legislation/default.rs index 996440ad..3a77365b 100644 --- a/actions/govbot/src/filters/wv-legislation/default.rs +++ b/actions/govbot/src/filters/wv-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wv-legislation (West Virginia): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wv --limit=100` +// `just govbot source --repos=wv --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wv --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wv --limit=100 --filter=default` // // Current filter removes: routine filings, introductions, referrals, and readings // ====================================== diff --git a/actions/govbot/src/filters/wy-legislation/default.rs b/actions/govbot/src/filters/wy-legislation/default.rs index 1a779daf..3ddba8d9 100644 --- a/actions/govbot/src/filters/wy-legislation/default.rs +++ b/actions/govbot/src/filters/wy-legislation/default.rs @@ -4,7 +4,7 @@ // // TO UPDATE THIS FILTER FOR wy-legislation (Wyoming): // 1. First, gather real data by running this command: -// `just govbot logs --repos=wy --limit=100` +// `just govbot source --repos=wy --limit=100` // 2. Analyze the output to identify patterns that are routine/noteworthy but not important: // - Routine actions: committee referrals, first readings, filings, prefiling, status updates // - Important actions: passage votes, executive signatures, amendments, failures, committee reports with substance @@ -15,7 +15,7 @@ // - Check `classification` array for routine classifications // - Check `description` string for routine text patterns (use `starts_with()`, `contains()`, or exact match) // - Return `FilterResult::FilterOut` for routine entries, `FilterResult::Keep` for important ones -// 5. Test your changes: `just govbot logs --repos=wy --limit=100 --filter=default` +// 5. Test your changes: `just govbot source --repos=wy --limit=100 --filter=default` // // Current filter removes: routine introductions, referrals, and status updates // ====================================== diff --git a/actions/govbot/src/git.rs b/actions/govbot/src/git.rs index de6c768d..d76f1b26 100644 --- a/actions/govbot/src/git.rs +++ b/actions/govbot/src/git.rs @@ -1,94 +1,32 @@ use crate::error::{Error, Result}; +use crate::registry::ResolvedDataset; use git2::{build::RepoBuilder, FetchOptions, RemoteCallbacks, Repository}; use std::fs; use std::path::{Path, PathBuf}; -// Repository URL template - fully configurable for any git hosting service +// ============================================================ +// Dataset git operations. // -// This template uses {locale} as a placeholder that will be replaced with the actual locale. -// You can configure it via the GOVBOT_REPO_URL_TEMPLATE environment variable. -// -// Examples: -// - GitHub: https://github.com/org/{locale}-suffix.git -// - GitLab: https://gitlab.com/org/{locale}-suffix.git -// - Bitbucket: https://bitbucket.org/org/{locale}-suffix.git -// - Self-hosted GitLab: https://git.example.com/group/{locale}-repo.git -// - Self-hosted Gitea: https://gitea.example.com/org/{locale}-data.git -// -// To use a custom URL template, set the environment variable: -// export GOVBOT_REPO_URL_TEMPLATE="https://gitlab.com/myorg/{locale}-data.git" -const DEFAULT_REPO_URL_TEMPLATE: &str = - "https://github.com/chn-openstates-files/{locale}-legislation.git"; - -/// Get the repository URL template from environment or use default -fn get_repo_url_template() -> String { - std::env::var("GOVBOT_REPO_URL_TEMPLATE") - .unwrap_or_else(|_| DEFAULT_REPO_URL_TEMPLATE.to_string()) -} - -/// Build the clone URL for a repository -pub fn build_clone_url(locale: &str) -> String { - let template = get_repo_url_template(); - template.replace("{locale}", locale) -} - -/// Extract repository name from URL template -/// For example: "https://github.com/org/{locale}-suffix.git" -> "{locale}-suffix" -fn extract_repo_name_pattern(template: &str) -> String { - // Extract the part after the last / and before .git - if let Some(start) = template.rfind('/') { - let after_slash = &template[start + 1..]; - if let Some(end) = after_slash.rfind(".git") { - after_slash[..end].to_string() - } else { - // No .git extension, just take after last / - after_slash.to_string() - } - } else { - // Fallback: return template as-is (might be just the pattern) - template.to_string() - } +// Datasets are git repos. Their URLs are NOT derived from a compiled locale +// enum or a `{locale}` URL template anymore — they are looked up at runtime in +// the dataset *registry* (`registry.rs`). A dataset is cloned ONCE per machine +// into the shared content-addressed cache (`cache.rs`); a project's +// `.govbot/repos/` is a symlink into that cache. +// ============================================================ + +/// The local directory name a dataset's clone is stored under, within a +/// project's `repos/` directory. This is the dataset's short (slash-free) +/// name plus the legacy `-legislation` data-repo suffix, so existing on-disk +/// layouts and downstream walkers (`source`, `load`) are unchanged. +/// +/// `wy` → `wy-legislation`. The suffix is overridable for tests/mocks via +/// `GOVBOT_REPO_SUFFIX` (the mock data uses `-data-pipeline`). +pub fn repo_dir_name(short_name: &str) -> String { + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + format!("{}{}", short_name, suffix) } -/// Extract organization/group from URL template -/// For example: "https://github.com/org/{locale}-suffix.git" -> "org" -fn extract_repo_org(template: &str) -> String { - // Extract the part between domain and repository name - // Format: https://domain.com/org/{locale}-suffix.git - if let Some(protocol_pos) = template.find("://") { - let after_protocol = &template[protocol_pos + 3..]; - if let Some(domain_end) = after_protocol.find('/') { - let after_domain = &after_protocol[domain_end + 1..]; - // Find the next / which should be before the repo name - if let Some(org_end) = after_domain.find('/') { - return after_domain[..org_end].to_string(); - } - // If no second /, the whole thing might be the org (unlikely but handle it) - if let Some(repo_start) = after_domain.find('{') { - return after_domain[..repo_start].trim_end_matches('/').to_string(); - } - } - } - // Fallback: return default org - "chn-openstates-files".to_string() -} - -/// Build the repository name (used for local directory names) -pub fn build_repo_name(locale: &str) -> String { - let template = get_repo_url_template(); - let pattern = extract_repo_name_pattern(&template); - pattern.replace("{locale}", locale) -} - -/// Build the repository path (org/repo-name format, used for display) -pub fn build_repo_path(locale: &str) -> String { - let template = get_repo_url_template(); - let org = extract_repo_org(&template); - let repo_name = build_repo_name(locale); - format!("{}/{}", org, repo_name) -} - -/// Get the default repos directory: $CWD/.govbot/repos +/// Get the default repos directory: `$CWD/.govbot/repos`. pub fn default_repos_dir() -> Result { let cwd = std::env::current_dir() .map_err(|_| Error::Config("Could not determine current working directory.".to_string()))?; @@ -96,6 +34,17 @@ pub fn default_repos_dir() -> Result { Ok(cwd.join(".govbot").join("repos")) } +/// The outcome of a clone/pull, plus the commit it landed on. +#[derive(Debug, Clone)] +pub struct PullOutcome { + /// `"clone"`, `"pulled"`, `"no_updates"`, or `"recloned"`. + pub action: &'static str, + /// The commit SHA the dataset is now checked out at. + pub commit: String, + /// The shared-cache key the dataset's clone lives under. + pub cache_key: String, +} + /// Build callbacks for git operations with optional token authentication fn build_callbacks(token: Option<&str>, show_progress: bool) -> RemoteCallbacks<'_> { let mut callbacks = RemoteCallbacks::new(); @@ -144,114 +93,168 @@ fn build_callbacks(token: Option<&str>, show_progress: bool) -> RemoteCallbacks< callbacks } -/// Clone or pull a repository for a given locale with quiet option -/// Returns action: "clone", "pulled", or "no_updates" -pub fn clone_or_pull_repo_quiet( - locale: &str, +/// Read the commit SHA `HEAD` currently resolves to in an open repository. +fn head_commit(repo: &Repository) -> Result { + let head = repo + .head() + .map_err(|e| Error::Config(format!("Failed to read HEAD: {}", e)))?; + let oid = head + .target() + .ok_or_else(|| Error::Config("HEAD has no commit target".to_string()))?; + Ok(oid.to_string()) +} + +/// Clone-or-pull a dataset into the shared content-addressed cache, then link +/// the cache entry into the project's `repos/` directory. +/// +/// This is the registry-driven replacement for the old locale-keyed +/// `clone_or_pull_repo_quiet`. It: +/// 1. resolves the dataset's cache key (URL + channel), +/// 2. clones into `~/.govbot/cache/` once, or `git pull`s it if present, +/// 3. symlinks `/` to that cache entry, +/// 4. returns the action taken plus the resolved commit SHA (for the lock). +/// +/// A second `pull` of the same dataset — in this or any other project — finds +/// the cache populated and only fetches deltas. +pub fn clone_or_pull_dataset( + dataset: &ResolvedDataset, repos_dir: &Path, token: Option<&str>, quiet: bool, -) -> Result<&'static str> { - let clone_url = build_clone_url(locale); - let repo_name = build_repo_name(locale); - let repo_path = build_repo_path(locale); - let target_dir = repos_dir.join(&repo_name); - let mut is_reclone = false; +) -> Result { + let short = dataset.short_name(); + let git_url = &dataset.entry.git_url; + let channel = dataset.channel.as_deref(); - // Check if repository already exists - if target_dir.exists() && Repository::open(&target_dir).is_ok() { - // Repository exists, pull instead - let repo = Repository::open(&target_dir) - .map_err(|e| Error::Config(format!("Failed to open repository: {}", e)))?; + let cache_entry = crate::cache::cache_path(short, git_url, channel)?; + let cache_key = crate::cache::cache_key(short, git_url, channel); - // Pull the latest changes (credentials will be used if token is provided) - match pull_repo_internal(&repo, token, quiet) { - Ok(had_updates) => { - // Explicitly drop the repository to ensure all file handles are closed - drop(repo); - - // Give the file system a moment to release all locks - std::thread::sleep(std::time::Duration::from_millis(50)); + let mut is_reclone = false; - return Ok(if had_updates { "pulled" } else { "no_updates" }); - } - Err(e) => { - // Check if this is a merge analysis error - let error_msg = e.to_string(); - if error_msg.contains("Failed to analyze merge") - || error_msg.contains("object not found") - { - // Close the repository first + let outcome_action: &'static str = + if cache_entry.exists() && Repository::open(&cache_entry).is_ok() { + // Cached already — pull deltas. + let repo = Repository::open(&cache_entry) + .map_err(|e| Error::Config(format!("Failed to open cached repository: {}", e)))?; + match pull_repo_internal(&repo, token, quiet) { + Ok(had_updates) => { drop(repo); - - // Delete the corrupted repository and reclone - if !quiet { - eprintln!( - "Merge analysis failed, deleting and recloning {}...", - repo_name - ); + std::thread::sleep(std::time::Duration::from_millis(50)); + if had_updates { + "pulled" + } else { + "no_updates" + } + } + Err(e) => { + let error_msg = e.to_string(); + if error_msg.contains("Failed to analyze merge") + || error_msg.contains("object not found") + { + drop(repo); + if !quiet { + eprintln!("Merge analysis failed, deleting and recloning {}...", short); + } + remove_dir_all_robust(&cache_entry).map_err(|e| { + Error::Config(format!("Failed to clear corrupt cache entry: {}", e)) + })?; + is_reclone = true; + // fall through to clone + "" + } else { + drop(repo); + return Err(e); } - - // Delete the repository - delete_repo(locale, repos_dir)?; - - // Mark that we're doing a reclone - is_reclone = true; - - // Now fall through to clone it fresh - } else { - // For other errors, close repo and return the error - drop(repo); - return Err(e); } } - } + } else { + "" + }; + + // If the cache entry is populated and we already pulled, we are done with + // the heavy step — just link and report. + if !outcome_action.is_empty() { + link_dataset(&cache_entry, repos_dir, short)?; + let repo = Repository::open(&cache_entry) + .map_err(|e| Error::Config(format!("Failed to reopen cached repository: {}", e)))?; + let commit = head_commit(&repo)?; + return Ok(PullOutcome { + action: outcome_action, + commit, + cache_key, + }); } - // Remove existing directory if it exists (but is not a git repo) - if target_dir.exists() { - if !quiet { - eprintln!("Removing existing directory: {}", target_dir.display()); - } - std::fs::remove_dir_all(&target_dir)?; + // Clone into the cache. + if cache_entry.exists() { + // A non-repo directory is squatting the cache slot — clear it. + let _ = std::fs::remove_dir_all(&cache_entry); + } + if let Some(parent) = cache_entry.parent() { + std::fs::create_dir_all(parent)?; } - - // Repository doesn't exist, clone it let mut fetch_options = FetchOptions::new(); - // Use a reasonable depth (50 commits) instead of depth=1 - // This provides enough history for merge analysis while still being faster than full clone - // 50 commits is typically enough for several weeks/months of history + // A 50-commit depth: enough history for merge analysis, faster than a + // full clone. fetch_options.depth(50); fetch_options.remote_callbacks(build_callbacks(token, !quiet)); let mut builder = RepoBuilder::new(); builder.fetch_options(fetch_options); + if let Some(channel) = channel { + builder.branch(channel); + } - builder.clone(&clone_url, &target_dir).map_err(|e| { - Error::Config(format!( - "Failed to shallow clone repository {}: {}", - repo_path, e - )) - })?; + builder + .clone(git_url, &cache_entry) + .map_err(|e| Error::Config(format!("Failed to clone dataset {}: {}", dataset.id, e)))?; - // After cloning, check if we need to set HEAD to main or master - let repo = Repository::open(&target_dir) + let repo = Repository::open(&cache_entry) .map_err(|e| Error::Config(format!("Failed to open cloned repository: {}", e)))?; - // Try to find the default branch (main or master) - // Check local branches first - let default_branch = if repo.find_branch("main", git2::BranchType::Local).is_ok() { - "main" - } else if repo.find_branch("master", git2::BranchType::Local).is_ok() { - "master" - } else { - // Check remote branches - if repo + // Resolve to a sensible default branch (main/master) if no channel given. + if channel.is_none() { + ensure_default_branch(&repo)?; + } + + let commit = head_commit(&repo)?; + drop(repo); + std::thread::sleep(std::time::Duration::from_millis(50)); + + if !quiet { + eprint!( + "\r \r" + ); + } + + link_dataset(&cache_entry, repos_dir, short)?; + + Ok(PullOutcome { + action: if is_reclone { "recloned" } else { "clone" }, + commit, + cache_key, + }) +} + +/// Link a populated cache entry into a project's `repos/` directory under the +/// dataset's `repo_dir_name`. +fn link_dataset(cache_entry: &Path, repos_dir: &Path, short_name: &str) -> Result<()> { + let project_repo = repos_dir.join(repo_dir_name(short_name)); + crate::cache::link_into_project(cache_entry, &project_repo) +} + +/// Ensure a freshly cloned repo's HEAD points at `main` or `master`. +fn ensure_default_branch(repo: &Repository) -> Result<()> { + let default_branch = + if repo.find_branch("main", git2::BranchType::Local).is_ok() { + "main" + } else if repo.find_branch("master", git2::BranchType::Local).is_ok() { + "master" + } else if repo .find_branch("origin/main", git2::BranchType::Remote) .is_ok() { - // Create local main branch from remote let remote_branch = repo.find_branch("origin/main", git2::BranchType::Remote)?; let commit = remote_branch.get().target().ok_or_else(|| { Error::Config("Failed to get commit from origin/main".to_string()) @@ -263,7 +266,6 @@ pub fn clone_or_pull_repo_quiet( .find_branch("origin/master", git2::BranchType::Remote) .is_ok() { - // Create local master branch from remote let remote_branch = repo.find_branch("origin/master", git2::BranchType::Remote)?; let commit = remote_branch.get().target().ok_or_else(|| { Error::Config("Failed to get commit from origin/master".to_string()) @@ -275,26 +277,13 @@ pub fn clone_or_pull_repo_quiet( return Err(Error::Config( "Neither 'main' nor 'master' branch found in repository".to_string(), )); - } - }; + }; - // Set HEAD to the default branch if it's not already set correctly - if let Ok(head) = repo.head() { - if let Some(head_name) = head.name() { - if head_name != format!("refs/heads/{}", default_branch) { - // HEAD points to a different branch, update it - repo.set_head(&format!("refs/heads/{}", default_branch)) - .map_err(|e| { - Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)) - })?; - repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) - .map_err(|e| { - Error::Config(format!("Failed to checkout {}: {}", default_branch, e)) - })?; - } - } - } else { - // HEAD doesn't exist, set it to the default branch + let needs_set = match repo.head() { + Ok(head) => head.name() != Some(&format!("refs/heads/{}", default_branch)[..]), + Err(_) => true, + }; + if needs_set { repo.set_head(&format!("refs/heads/{}", default_branch)) .map_err(|e| { Error::Config(format!("Failed to set HEAD to {}: {}", default_branch, e)) @@ -302,44 +291,7 @@ pub fn clone_or_pull_repo_quiet( repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) .map_err(|e| Error::Config(format!("Failed to checkout {}: {}", default_branch, e)))?; } - - // Explicitly drop the repository to ensure all file handles are closed - // This is important on macOS where file handles can prevent deletion - drop(repo); - - // Give the file system a moment to release all locks - // This helps on macOS where file handles might not be released immediately - std::thread::sleep(std::time::Duration::from_millis(50)); - - // Clear any progress line - if !quiet { - eprint!( - "\r \r" - ); - } - - // Return "recloned" if we deleted and recloned, otherwise "clone" - Ok(if is_reclone { "recloned" } else { "clone" }) -} - -/// Clone or pull a repository for a given locale (clones if doesn't exist, pulls if it does) -pub fn clone_or_pull_repo(locale: &str, repos_dir: &Path, token: Option<&str>) -> Result<()> { - clone_or_pull_repo_quiet(locale, repos_dir, token, false).map(|_| ()) -} - -/// Clone a repository for a given locale (deprecated - use clone_or_pull_repo) -pub fn clone_repo(locale: &str, repos_dir: &Path, token: Option<&str>) -> Result<()> { - clone_or_pull_repo(locale, repos_dir, token) -} - -/// Clone a repository for a given locale with quiet option (deprecated - use clone_or_pull_repo_quiet) -pub fn clone_repo_quiet( - locale: &str, - repos_dir: &Path, - token: Option<&str>, - quiet: bool, -) -> Result<()> { - clone_or_pull_repo_quiet(locale, repos_dir, token, quiet).map(|_| ()) + Ok(()) } /// Internal function to pull changes from a repository @@ -353,7 +305,8 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re let local_branch_name = head .name() .and_then(|name| name.strip_prefix("refs/heads/")) - .ok_or_else(|| Error::Config("Failed to determine local branch name".to_string()))?; + .ok_or_else(|| Error::Config("Failed to determine local branch name".to_string()))? + .to_string(); // Fetch from remote - try both main and master let mut remote = repo @@ -366,100 +319,67 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re let mut fetch_options = FetchOptions::new(); fetch_options.remote_callbacks(build_callbacks(token, !quiet)); - // If it's a shallow repo, we need to fetch more history for merge analysis to work - // The issue is that shallow clones only have 1 commit, so merge_analysis can't find - // the common ancestor. We need to fetch enough history to unshallow the repo. + // If it's a shallow repo, fetch more history so merge analysis can find the + // common ancestor — a shallow clone of 1 commit has none. if is_shallow { - // Fetch all refs to get full history - this unshallows the repository - // This ensures merge_analysis can find the common ancestor between local and remote let all_refs = vec!["+refs/*:refs/remotes/origin/*"]; let _ = remote.fetch(&all_refs, Some(&mut fetch_options), None); } - // Fetch both main and master branches (only fail if both fail) + // Fetch the current branch plus the usual defaults. + let branch_refspec = format!("refs/heads/{0}:refs/remotes/origin/{0}", local_branch_name); let refspecs = vec![ + branch_refspec.as_str(), "refs/heads/main:refs/remotes/origin/main", "refs/heads/master:refs/remotes/origin/master", ]; - // Try to fetch both branches - ignore errors for individual branches let fetch_result = remote.fetch(&refspecs, Some(&mut fetch_options), None); - // If fetch completely fails, return error if fetch_result.is_err() { - // Check if at least one branch exists remotely by trying to find them - let has_main = repo - .find_branch("origin/main", git2::BranchType::Remote) + let has_branch = repo + .find_branch( + &format!("origin/{}", local_branch_name), + git2::BranchType::Remote, + ) .is_ok(); - let has_master = repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok(); - - if !has_main && !has_master { + if !has_branch { return Err(Error::Config( - "Failed to fetch from remote and neither 'main' nor 'master' branch found" - .to_string(), + "Failed to fetch from remote and the tracked branch was not found".to_string(), )); } - // If at least one exists, continue (fetch might have partially succeeded) } - // Determine which remote branch to use based on local branch - // If local is main, use origin/main; if local is master, use origin/master - // Otherwise, prefer main over master - let (remote_branch_name, target_local_branch) = if local_branch_name == "main" { - if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - ("origin/main", "main") - } else if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - ("origin/master", "master") - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in remote repository".to_string(), - )); - } - } else if local_branch_name == "master" { - if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - ("origin/master", "master") - } else if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - ("origin/main", "main") - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in remote repository".to_string(), - )); - } + // Track the branch we are on; fall back to main/master if it's gone. + let (remote_branch_name, target_local_branch) = if repo + .find_branch( + &format!("origin/{}", local_branch_name), + git2::BranchType::Remote, + ) + .is_ok() + { + ( + format!("origin/{}", local_branch_name), + local_branch_name.clone(), + ) + } else if repo + .find_branch("origin/main", git2::BranchType::Remote) + .is_ok() + { + ("origin/main".to_string(), "main".to_string()) + } else if repo + .find_branch("origin/master", git2::BranchType::Remote) + .is_ok() + { + ("origin/master".to_string(), "master".to_string()) } else { - // Local branch is neither main nor master - prefer main, fallback to master - if repo - .find_branch("origin/main", git2::BranchType::Remote) - .is_ok() - { - ("origin/main", "main") - } else if repo - .find_branch("origin/master", git2::BranchType::Remote) - .is_ok() - { - ("origin/master", "master") - } else { - return Err(Error::Config( - "Neither 'main' nor 'master' branch found in remote repository".to_string(), - )); - } + return Err(Error::Config( + "No tracked branch found in remote repository".to_string(), + )); }; let remote_branch = repo - .find_branch(remote_branch_name, git2::BranchType::Remote) + .find_branch(&remote_branch_name, git2::BranchType::Remote) .map_err(|e| { Error::Config(format!( "Failed to find remote branch {}: {}", @@ -477,13 +397,12 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re // If local branch doesn't match the target, switch to it if local_branch_name != target_local_branch { - // Check if local branch exists, if not create it if repo - .find_branch(target_local_branch, git2::BranchType::Local) + .find_branch(&target_local_branch, git2::BranchType::Local) .is_err() { let commit_obj = repo.find_commit(remote_commit)?; - repo.branch(target_local_branch, &commit_obj, false)?; + repo.branch(&target_local_branch, &commit_obj, false)?; } repo.set_head(&format!("refs/heads/{}", target_local_branch)) @@ -504,10 +423,8 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re .map_err(|e| Error::Config(format!("Failed to analyze merge: {}", e)))?; if analysis.0.is_up_to_date() { - // Already up to date - return Ok(false); + Ok(false) } else if analysis.0.is_fast_forward() { - // Fast-forward merge let mut reference = head .resolve() .map_err(|e| Error::Config(format!("Failed to resolve HEAD: {}", e)))?; @@ -518,65 +435,13 @@ fn pull_repo_internal(repo: &Repository, token: Option<&str>, quiet: bool) -> Re .map_err(|e| Error::Config(format!("Failed to set HEAD: {}", e)))?; repo.checkout_head(Some(git2::build::CheckoutBuilder::default().force())) .map_err(|e| Error::Config(format!("Failed to checkout: {}", e)))?; - - // Updates were made - return Ok(true); + Ok(true) } else { - // Need to merge - return Err(Error::Config( + Err(Error::Config( "Repository has diverged and cannot be fast-forwarded. Please resolve manually." .to_string(), - )); - } -} - -/// Pull a repository for a given locale -pub fn pull_repo(locale: &str, repos_dir: &Path, token: Option<&str>) -> Result<()> { - pull_repo_quiet(locale, repos_dir, token, false) -} - -/// Pull a repository for a given locale with quiet option -pub fn pull_repo_quiet( - locale: &str, - repos_dir: &Path, - token: Option<&str>, - quiet: bool, -) -> Result<()> { - let repo_name = build_repo_name(locale); - let repo_path = build_repo_path(locale); - let target_dir = repos_dir.join(&repo_name); - - let repo = match Repository::open(&target_dir) { - Ok(repo) => repo, - Err(_) => { - if !quiet { - eprintln!("Repository does not exist: {}. Skipping.", repo_path); - } - return Ok(()); - } - }; - - // Pull the latest changes (credentials will be used if token is provided) - if !quiet { - eprintln!("Pulling repository: {}", repo_path); - } - - pull_repo_internal(&repo, token, quiet)?; - - // Explicitly drop the repository to ensure all file handles are closed - drop(repo); - - // Give the file system a moment to release all locks - std::thread::sleep(std::time::Duration::from_millis(50)); - - // Clear any progress line - if !quiet { - eprint!( - "\r \r" - ); - eprintln!("Successfully pulled {}", repo_path); + )) } - Ok(()) } /// Calculate the size of a directory in bytes @@ -592,7 +457,6 @@ pub fn get_directory_size(path: &Path) -> Result { if metadata.is_file() { *total += metadata.len(); } else if metadata.is_dir() { - // Recursively calculate size of subdirectories for sub_entry in fs::read_dir(entry.path())? { let sub_entry = sub_entry?; calculate_size(&sub_entry, total)?; @@ -631,125 +495,51 @@ pub fn format_size(bytes: u64) -> String { } } -/// Get estimated remote repository size by doing a lightweight fetch -/// This fetches only refs and estimates size from transfer progress -pub fn get_remote_repo_size_estimate( - repo: &Repository, - token: Option<&str>, - _quiet: bool, -) -> Result { - use std::sync::{Arc, Mutex}; - - let mut remote = repo - .find_remote("origin") - .map_err(|e| Error::Config(format!("Failed to find remote 'origin': {}", e)))?; - - let size_estimate = Arc::new(Mutex::new(0u64)); - let size_estimate_clone = size_estimate.clone(); - - let mut fetch_options = FetchOptions::new(); - let token = token.map(|t| t.to_string()); - - let mut callbacks = RemoteCallbacks::new(); - callbacks.credentials(move |_url, _username, _allowed| { - if let Some(ref token) = token { - git2::Cred::userpass_plaintext("x-access-token", token) - } else { - git2::Cred::default() - } - }); - - // Track transfer progress to estimate size - callbacks.transfer_progress(move |stats| { - // received_bytes() gives us the total bytes received so far - let bytes = stats.received_bytes() as u64; - let mut size = size_estimate_clone.lock().unwrap(); - *size = bytes; - true - }); - - fetch_options.remote_callbacks(callbacks); - - // Do a lightweight fetch - fetch refs only, not objects - // This will give us size information without downloading everything - let _fetch_result = remote.fetch( - &["refs/heads/*:refs/remotes/origin/*"], - Some(&mut fetch_options), - None, - ); - - // Even if fetch fails, we might have gotten some size info - let estimated_size = *size_estimate.lock().unwrap(); - - if estimated_size > 0 { - Ok(estimated_size) - } else { - // Fallback: estimate from local pack files if they exist - let pack_dir = repo.path().join("objects").join("pack"); - if pack_dir.exists() { - Ok(get_directory_size(&pack_dir).unwrap_or(0)) - } else { - Ok(0) - } - } -} - -/// Extract suffix from URL template (everything after {locale}) -/// For example: "{locale}-legislation" -> "-legislation" -fn extract_repo_suffix(template: &str) -> String { - let pattern = extract_repo_name_pattern(template); - if let Some(locale_pos) = pattern.find("{locale}") { - // Get everything after {locale} - pattern[locale_pos + 8..].to_string() // 8 is length of "{locale}" - } else { - // Fallback: try common patterns - "-legislation".to_string() - } -} - -/// Get all available locale repositories in the repos directory -pub fn get_available_locales(repos_dir: &Path) -> Result> { +/// List the datasets locally present in a project's `repos/` directory, +/// returned as short names (the registry/manifest identifier form). +/// +/// A "dataset directory" is any directory (or symlink-to-directory) whose name +/// carries the dataset suffix — it need not be a live git repo, so mock data +/// and non-git extracts are listed too. +pub fn get_local_datasets(repos_dir: &Path) -> Result> { if !repos_dir.exists() { return Ok(Vec::new()); } - let template = get_repo_url_template(); - let suffix = extract_repo_suffix(&template); - let mut locales = Vec::new(); + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + let mut datasets = Vec::new(); for entry in std::fs::read_dir(repos_dir)? { let entry = entry?; let path = entry.path(); - if path.is_dir() && Repository::open(&path).is_ok() { + // A symlink into the cache or a real clone — both count. `is_dir()` + // follows a symlink, so a cache symlink resolves correctly. + if path.is_dir() { if let Some(dir_name) = path.file_name().and_then(|n| n.to_str()) { - // Check for current format first, then old format for backward compatibility - if !suffix.is_empty() { - if let Some(locale) = dir_name.strip_suffix(&suffix) { - locales.push(locale.to_string()); - continue; - } + if let Some(short) = dir_name.strip_suffix(&suffix) { + datasets.push(short.to_string()); + continue; } - // Fallback to old format for backward compatibility - if let Some(locale) = dir_name.strip_suffix("-data-pipeline") { - locales.push(locale.to_string()); + // Legacy layout fallback. + if let Some(short) = dir_name.strip_suffix("-data-pipeline") { + datasets.push(short.to_string()); } } } } - - Ok(locales) + datasets.sort(); + Ok(datasets) } -/// Recursively remove a directory and all its contents -/// This is more robust than remove_dir_all on macOS +/// Recursively remove a directory and all its contents. +/// More robust than `remove_dir_all` on macOS. fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { if !path.exists() { return Ok(()); } - if path.is_file() { - // Make file writable before removing + if path.is_file() || path.is_symlink() { let _ = std::fs::metadata(path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -759,14 +549,12 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { return std::fs::remove_file(path); } - // For directories, recursively remove contents first let entries: Vec<_> = std::fs::read_dir(path)?.collect(); for entry_result in entries { let entry = entry_result?; let entry_path = entry.path(); - // Make writable before trying to remove let _ = std::fs::metadata(&entry_path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -775,20 +563,16 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { }); if entry_path.is_dir() { - // Recursively remove subdirectory if remove_dir_all_robust(&entry_path).is_err() { - // If recursive removal fails, try a few more times for _ in 0..3 { std::thread::sleep(std::time::Duration::from_millis(100)); if remove_dir_all_robust(&entry_path).is_ok() { break; } } - // If still failing, try direct removal let _ = std::fs::remove_dir_all(&entry_path); } } else { - // Try to remove file multiple times let mut removed = false; for _ in 0..3 { if std::fs::remove_file(&entry_path).is_ok() { @@ -798,7 +582,6 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { std::thread::sleep(std::time::Duration::from_millis(50)); } if !removed { - // Last resort: try to make it writable again and remove let _ = std::fs::metadata(&entry_path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -810,7 +593,6 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { } } - // Make directory writable before removing let _ = std::fs::metadata(path).and_then(|m| { use std::os::unix::fs::PermissionsExt; let mut perms = m.permissions(); @@ -818,8 +600,6 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { std::fs::set_permissions(path, perms) }); - // Now try to remove the directory itself - // Retry multiple times for macOS let mut last_error = None; for _ in 0..5 { match std::fs::remove_dir(path) { @@ -831,11 +611,9 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { } } - // Final attempt with remove_dir_all match std::fs::remove_dir_all(path) { Ok(_) => Ok(()), Err(e) => { - // Return the more specific error if available if let Some(prev_error) = last_error { Err(prev_error) } else { @@ -845,65 +623,55 @@ fn remove_dir_all_robust(path: &Path) -> std::io::Result<()> { } } -/// Delete a repository for a given locale -pub fn delete_repo(locale: &str, repos_dir: &Path) -> Result<()> { - let repo_name = build_repo_name(locale); - let target_dir = repos_dir.join(&repo_name); +/// Remove a dataset's clone from a project's `repos/` directory. +/// +/// This unlinks the project's reference (the symlink into the shared cache); +/// the cache entry itself is left intact, since other projects may use it. +pub fn delete_dataset(short_name: &str, repos_dir: &Path) -> Result<()> { + let target_dir = repos_dir.join(repo_dir_name(short_name)); - if !target_dir.exists() { - return Ok(()); // Repository doesn't exist, nothing to delete + if !target_dir.exists() && std::fs::symlink_metadata(&target_dir).is_err() { + return Ok(()); // Nothing to delete. } - // Try to open and close the repository first to release any locks - // This helps on macOS where git files might be locked + // A symlink into the cache: unlink it, leave the cache entry. + if let Ok(meta) = std::fs::symlink_metadata(&target_dir) { + if meta.file_type().is_symlink() { + return std::fs::remove_file(&target_dir).map_err(|e| { + Error::Config(format!("Failed to unlink dataset {}: {}", short_name, e)) + }); + } + } + + // A real directory (a pre-cache clone): remove it. if let Ok(repo) = Repository::open(&target_dir) { - // Try to close the index explicitly if possible - // The index file is often the one that gets locked - let git_dir = repo.path(); + let git_dir = repo.path().to_path_buf(); let index_path = git_dir.join("index"); - - // Force close the repository to release file handles drop(repo); - - // Give it a moment for file handles to be released std::thread::sleep(std::time::Duration::from_millis(100)); - - // Try to remove the index file explicitly if it exists - // This often helps on macOS if index_path.exists() { let _ = std::fs::remove_file(&index_path); } } - // Use robust removal that handles macOS edge cases if let Err(e) = remove_dir_all_robust(&target_dir) { - // If robust removal fails, try using shell command as fallback - // This is often more reliable on macOS for stubborn directories let output = std::process::Command::new("rm") .arg("-rf") .arg(&target_dir) .output(); - match output { - Ok(result) if result.status.success() => { - // Successfully removed via shell command - Ok(()) - } + Ok(result) if result.status.success() => Ok(()), Ok(result) => { - // Shell command failed, return original error with shell error info let shell_err = String::from_utf8_lossy(&result.stderr); Err(Error::Config(format!( - "Failed to delete repository {}: {} (shell fallback also failed: {})", - repo_name, e, shell_err - ))) - } - Err(shell_err) => { - // Couldn't execute shell command, return original error - Err(Error::Config(format!( - "Failed to delete repository {}: {} (shell fallback unavailable: {})", - repo_name, e, shell_err + "Failed to delete dataset {}: {} (shell fallback also failed: {})", + short_name, e, shell_err ))) } + Err(shell_err) => Err(Error::Config(format!( + "Failed to delete dataset {}: {} (shell fallback unavailable: {})", + short_name, e, shell_err + ))), } } else { Ok(()) diff --git a/actions/govbot/src/init_from_frankie.rs b/actions/govbot/src/init_from_frankie.rs new file mode 100644 index 00000000..77d8643b --- /dev/null +++ b/actions/govbot/src/init_from_frankie.rs @@ -0,0 +1,559 @@ +//! `govbot init --from-frankie-config` — migration tool that scaffolds a +//! govbot+fastclass project from a Frankie-style per-topic config. +//! +//! Frankie is the original CHN-Bluesky-Govbot framework. Each topic +//! (`transportation`, `immigration`, `housing`, …) lives in a +//! `topics//config.yml` carrying a name, display name, default emoji, +//! a flat keyword list covering subdomains, a keyword→emoji map, a digest +//! title, and a topic focus. This module reads one such file and emits a +//! govbot+fastclass project skeleton — a `govbot.yml` manifest plus a +//! fastclass classifier bundle plus the supporting stubs — so an existing +//! Frankie topic maintainer can migrate to the new stack without rebuilding +//! the keyword list from scratch. +//! +//! ### Field-to-field mapping +//! +//! | Frankie field | Scaffolded output | +//! |------------------|--------------------------------------------------------| +//! | `name` | classifier tag name + bluesky `select: []` | +//! | `display_name` | tag description framing + README header | +//! | `default_emoji` | README header + summarizer prompt voice | +//! | `keywords` | `classifier/classifier.yml: tags..include_keywords` | +//! | `emoji_map` | classifier.yml comment listing the keyword→emoji map | +//! | `digest_title` | `publish.feed.title` + `publish.site.title` | +//! | `topic` | tag description + summarizer prompt subject | +//! +//! No network calls — purely a local-file transformation. +//! +//! ### Atomicity +//! +//! Refuses to overwrite if `/govbot.yml` already exists. Otherwise +//! scaffolds everything before reporting success: a failure mid-write leaves +//! a partial tree (the user can `rm -rf ` and retry), but the +//! pre-flight check is the primary guard against clobbering an existing +//! project. + +use anyhow::{anyhow, Context, Result}; +use serde::Deserialize; +use std::collections::BTreeMap; +use std::fs; +use std::path::{Path, PathBuf}; + +/// The Frankie per-topic config shape. Permissive: extra fields are absorbed +/// into `extra` so a Frankie config that carries fields we don't use yet +/// (timezone, schedule, jurisdictions, …) still parses cleanly. +#[derive(Debug, Deserialize)] +pub struct FrankieTopicConfig { + /// The machine-readable topic name (e.g. `"transportation"`). Becomes the + /// classifier's single tag name and the bluesky publisher's `select` entry. + pub name: String, + + /// Optional human-readable display name (e.g. `"Transportation"`). + /// Defaults to a title-cased `name` if absent. + pub display_name: Option, + + /// Default emoji for the topic (e.g. `"🚗"`). + pub default_emoji: Option, + + /// The flat keyword list covering the topic's subdomains. Becomes the + /// classifier tag's `include_keywords`. + #[serde(default)] + pub keywords: Vec, + + /// Keyword→emoji map (e.g. `rail` → `"🚆"`). Surfaced in the classifier + /// bundle as a comment so the migrating maintainer can fold it back into a + /// post template later. + #[serde(default)] + pub emoji_map: BTreeMap, + + /// The Frankie digest title (e.g. `"🗳️ Transportation Bills Weekly + /// Digest"`). Becomes the RSS/HTML publisher title. + pub digest_title: Option, + + /// The Frankie "topic focus" string — a short framing the summarizer + /// uses (e.g. `"transportation"`). + pub topic: Option, + + /// Catch-all for fields Frankie carries that this migration tool does not + /// translate. Held so unknown fields don't fail the parse. + #[serde(flatten)] + pub extra: serde_yaml::Value, +} + +impl FrankieTopicConfig { + /// Parse a Frankie-style `topics//config.yml` from a path. + pub fn load(path: &Path) -> Result { + let contents = fs::read_to_string(path) + .with_context(|| format!("Failed to read Frankie config: {}", path.display()))?; + let parsed: Self = serde_yaml::from_str(&contents) + .with_context(|| format!("Failed to parse Frankie config: {}", path.display()))?; + if parsed.name.trim().is_empty() { + return Err(anyhow!( + "Frankie config {} has empty `name` — required to scaffold a classifier tag", + path.display() + )); + } + Ok(parsed) + } + + /// The human-readable display name. Falls back to title-casing `name` + /// (e.g. `transportation` → `Transportation`). + pub fn display(&self) -> String { + self.display_name.clone().unwrap_or_else(|| { + let mut chars = self.name.chars(); + match chars.next() { + None => String::new(), + Some(first) => first.to_uppercase().collect::() + chars.as_str(), + } + }) + } + + /// The summarizer-prompt topic focus, defaulting to the lowercased name. + pub fn topic_focus(&self) -> String { + self.topic.clone().unwrap_or_else(|| self.name.clone()) + } +} + +/// Scaffold a govbot+fastclass project at `into` from a parsed Frankie config. +/// Returns the absolute path the project was written to. +pub fn scaffold(config: &FrankieTopicConfig, into: &Path) -> Result { + // Pre-flight guard: refuse to clobber an existing project. + let manifest_path = into.join("govbot.yml"); + if manifest_path.exists() { + return Err(anyhow!( + "{} already exists — refusing to overwrite an existing govbot project. \ + Remove it first or scaffold into a different directory with --into .", + manifest_path.display() + )); + } + + fs::create_dir_all(into) + .with_context(|| format!("Failed to create scaffold dir: {}", into.display()))?; + + // 1. govbot.yml manifest + fs::write(&manifest_path, render_govbot_yml(config)) + .with_context(|| format!("Failed to write {}", manifest_path.display()))?; + + // 2. classifier bundle + let classifier_dir = into.join("classifier"); + fs::create_dir_all(&classifier_dir)?; + fs::write( + classifier_dir.join("classifier.yml"), + render_classifier_yml(config), + )?; + fs::write(classifier_dir.join("fusion.yml"), render_fusion_yml())?; + + let eval_dir = classifier_dir.join("eval"); + fs::create_dir_all(&eval_dir)?; + fs::write( + eval_dir.join("constitution.yml"), + render_constitution_yml(config), + )?; + fs::write(eval_dir.join("rolling.yml"), render_rolling_yml())?; + + // proposals dir — empty; the improvement loop populates it. + fs::create_dir_all(classifier_dir.join("proposals"))?; + // Keep the dir tracked even though it is empty today. + fs::write(classifier_dir.join("proposals").join(".gitkeep"), "")?; + + // 3. summarizer stub + let summarizer_dir = into.join("summarizer"); + fs::create_dir_all(&summarizer_dir)?; + fs::write( + summarizer_dir.join("prompt.md"), + render_summarizer_prompt(config), + )?; + + // 4. README + fs::write(into.join("README.md"), render_readme(config))?; + + // 5. .gitignore + fs::write(into.join(".gitignore"), render_gitignore())?; + + Ok(into.to_path_buf()) +} + +/// Run the full `--from-frankie-config [--into ]` flow: parse, +/// scaffold, and print activist-facing next-steps to stdout. +pub fn run(from_config: &Path, into: Option<&Path>) -> Result<()> { + let config = FrankieTopicConfig::load(from_config)?; + let cwd = std::env::current_dir()?; + let into_path: PathBuf = into.map(|p| p.to_path_buf()).unwrap_or(cwd); + let written = scaffold(&config, &into_path)?; + print_next_steps(&written, from_config); + Ok(()) +} + +fn print_next_steps(into: &Path, from_config: &Path) { + println!( + "✓ Scaffolded govbot+fastclass project at {}.", + into.display() + ); + println!(); + println!( + "This project was created from {}. The keyword list became", + from_config.display() + ); + println!("your starter classifier; everything else is yours to refine."); + println!(); + println!("Recommended next steps:"); + println!(); + println!(" 1. Install the Tier-2 semantic model so embedding matchers fire:"); + println!(" fastclass model fetch --bundle ./classifier"); + println!(); + println!(" 2. Replace the placeholder constitution items with real labeled examples:"); + println!(" /fastclass:seed-gold ./classifier"); + println!(); + println!(" 3. Try classifying:"); + println!(" govbot run --dry-run"); + println!(); + println!(" 4. Iterate quality via the improvement loop:"); + println!(" /fastclass:improve autonomous"); + println!(); + println!(" 5. Set Bluesky credentials (env-only — never in govbot.yml):"); + println!(" export BLUESKY_HANDLE=..."); + println!(" export BLUESKY_APP_PASSWORD=..."); +} + +// --------------------------------------------------------------------------- +// File renderers — each pure function takes the Frankie config and returns +// the file contents. Keeping rendering pure makes unit testing trivial. +// --------------------------------------------------------------------------- + +fn render_govbot_yml(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let title = config + .digest_title + .clone() + .unwrap_or_else(|| format!("{} Bills Weekly Digest", display)); + + let mut yml = String::new(); + yml.push_str("# Govbot manifest — scaffolded from a Frankie-style topic config.\n"); + yml.push_str("# See README.md for the migration story. Tune the classifier bundle\n"); + yml.push_str("# (./classifier) with the fastclass improvement loop, not by hand.\n"); + yml.push_str("$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json\n\n"); + + yml.push_str("datasets:\n"); + yml.push_str(" - all\n\n"); + + yml.push_str("transforms:\n"); + yml.push_str(" classify:\n"); + yml.push_str(" command: [fastclass, classify, \"-\"]\n"); + yml.push_str(" reads: docs\n"); + yml.push_str(" writes: classification\n"); + yml.push_str(" classifier: ./classifier\n\n"); + + yml.push_str("publish:\n"); + yml.push_str(" feed:\n"); + yml.push_str(" type: rss\n"); + yml.push_str(&format!(" select: [{}]\n", config.name)); + yml.push_str(&format!(" title: {}\n", yaml_string(&title))); + yml.push_str(" base_url: \"https://example.org/your-deployment\"\n"); + yml.push_str(" output_dir: dist\n"); + yml.push_str(&format!(" output_file: {}-feed.xml\n\n", config.name)); + + yml.push_str(" site:\n"); + yml.push_str(" type: html\n"); + yml.push_str(&format!(" select: [{}]\n", config.name)); + yml.push_str(&format!(" title: {}\n", yaml_string(&title))); + yml.push_str(" base_url: \"https://example.org/your-deployment\"\n"); + yml.push_str(" output_dir: dist\n\n"); + + yml.push_str(" bluesky:\n"); + yml.push_str(" type: bluesky\n"); + yml.push_str(&format!(" select: [{}]\n", config.name)); + yml.push_str(" # Calibrated final_score threshold (0..1). 0.55 is a sensible starting\n"); + yml.push_str(" # point per the climate-activist deployment; raise to cut false\n"); + yml.push_str(" # positives, lower to widen recall.\n"); + yml.push_str(" min_score: 0.55\n"); + yml.push_str(" base_url: \"https://example.org/your-deployment\"\n"); + yml.push_str(" post_template: \"{title}\\n\\n{tags} · {link}\"\n"); + yml.push_str(" # Credentials are env-only: BLUESKY_HANDLE / BLUESKY_APP_PASSWORD.\n"); + yml.push_str(" # Never put them in this file.\n\n"); + + yml.push_str("pipelines:\n"); + yml.push_str(" default:\n"); + yml.push_str(" - classify\n"); + yml.push_str(" - feed\n"); + yml.push_str(" - site\n"); + yml.push_str(" - bluesky\n"); + + yml +} + +fn render_classifier_yml(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let topic_focus = config.topic_focus(); + + let mut yml = String::new(); + yml.push_str("# classifier.yml — the taxonomy for this govbot+fastclass project.\n"); + yml.push_str("#\n"); + yml.push_str("# Scaffolded from a Frankie-style topic config. The single tag below\n"); + yml.push_str("# carries the keyword list from that config verbatim; everything else\n"); + yml.push_str("# (exclude gates, regex, examples, HyDE queries, subjects) is yours\n"); + yml.push_str("# to grow via the /fastclass:improve loop. Never hand-tune by guessing;\n"); + yml.push_str("# every change should be proved against the frozen gold set in\n"); + yml.push_str("# eval/constitution.yml.\n"); + if !config.emoji_map.is_empty() { + yml.push_str("#\n"); + yml.push_str("# Frankie emoji_map (kept here for reference — fold into a post template\n"); + yml.push_str("# later if you want per-subdomain emoji in Bluesky posts):\n"); + for (keyword, emoji) in &config.emoji_map { + yml.push_str(&format!("# {} → {}\n", keyword, emoji)); + } + } + yml.push_str("tags:\n"); + yml.push_str(&format!(" {}:\n", config.name)); + yml.push_str(&format!( + " description: >-\n Bills about {} — scaffolded from the Frankie topic\n config for \"{}\". Refine this description as the tag evolves.\n", + topic_focus, display + )); + yml.push_str(" include_keywords:\n"); + for keyword in &config.keywords { + yml.push_str(&format!(" - {}\n", yaml_string(keyword))); + } + yml.push_str(" # examples are intentionally empty — add real labeled bills via\n"); + yml.push_str(" # /fastclass:from-intent or by curating eval/constitution.yml.\n"); + yml.push_str(" examples: []\n"); + yml.push_str(" threshold: 0.3\n"); + + yml +} + +fn render_fusion_yml() -> String { + // Mirrors the climate-activist bundle's fusion.yml — the portable + // `models:` block declares the encoder + reranker so `fastclass model + // fetch --bundle ./classifier` resolves and installs them. + let mut yml = String::new(); + yml.push_str("# fusion.yml — global fusion config for the classifier bundle.\n"); + yml.push_str("# Owned by fastclass. Per-tag overrides live INLINE in classifier.yml.\n"); + yml.push_str("version: fusion-v1\n\n"); + yml.push_str( + "# Portable model declaration. Run `fastclass model fetch --bundle ./classifier`\n", + ); + yml.push_str("# to install these into the shared ~/.govbot/models// cache.\n"); + yml.push_str("models:\n"); + yml.push_str(" encoder: sentence-transformers/all-MiniLM-L6-v2\n"); + yml.push_str(" reranker: cross-encoder/ms-marco-MiniLM-L-6-v2\n\n"); + yml.push_str("# Default fusion weight per matcher kind, applied to any tag that does not\n"); + yml.push_str("# declare its own inline `fusion_weights`.\n"); + yml.push_str("weights:\n"); + yml.push_str(" keyword: 1.0\n"); + yml.push_str(" regex: 0.8\n\n"); + yml.push_str("# Cascade uncertainty band. Documents whose fused score lands inside\n"); + yml.push_str("# [low, high] are the uncertain ones the improvement loop focuses on.\n"); + yml.push_str("uncertainty_band:\n"); + yml.push_str(" low: 0.3\n"); + yml.push_str(" high: 0.7\n\n"); + yml.push_str("splitters:\n"); + yml.push_str(" default:\n"); + yml.push_str(" strategy: whole\n"); + yml.push_str(" sections:\n"); + yml.push_str(" strategy: sections\n"); + yml.push_str(" aggregation: max\n"); + yml +} + +fn render_constitution_yml(config: &FrankieTopicConfig) -> String { + // PLACEHOLDER items per the seed-gold pattern, clearly marked. The + // activist replaces them with real labeled bills via /fastclass:seed-gold. + let mut yml = String::new(); + yml.push_str("# constitution.yml — the FROZEN gold standard for this classifier.\n"); + yml.push_str("# Never shown to an LLM. The items below are PLACEHOLDERS — replace them\n"); + yml.push_str("# with real labeled bills (use /fastclass:seed-gold ./classifier) before\n"); + yml.push_str("# relying on the improvement loop's judgement.\n"); + yml.push_str("items:\n"); + yml.push_str(&format!(" - id: placeholder-{}-positive\n", config.name)); + yml.push_str(" text: >-\n"); + yml.push_str(&format!( + " PLACEHOLDER — replace with a real {} bill the classifier should\n tag (positive example).\n", + config.topic_focus() + )); + yml.push_str(&format!(" expected_tags: [{}]\n", config.name)); + yml.push_str(&format!(" - id: placeholder-{}-negative\n", config.name)); + yml.push_str(" text: >-\n"); + yml.push_str(&format!( + " PLACEHOLDER — replace with a real bill that should NOT be tagged\n {} (negative example used to gate false-positives).\n", + config.name + )); + yml.push_str(" expected_tags: []\n"); + yml +} + +fn render_rolling_yml() -> String { + let mut yml = String::new(); + yml.push_str("# rolling.yml — the refreshable working eval set.\n"); + yml.push_str("# The improvement loop adds failing bills here and proves fixes against\n"); + yml.push_str("# the (unseen) constitution. Empty today — start by labeling a handful\n"); + yml.push_str("# of bills from `govbot source --select docs` you disagree with.\n"); + yml.push_str("items: []\n"); + yml +} + +fn render_summarizer_prompt(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let topic = config.topic_focus(); + let mut s = String::new(); + s.push_str(&format!("# {} summarizer prompt (stub)\n\n", display)); + s.push_str(&format!( + "Describe this bill in one neutral sentence, focused on {} policy.\n", + topic + )); + s.push_str( + "Avoid editorial language; let the bill text speak for itself. \ + A future `summarize` transform will read this prompt — today it is a\n\ + placeholder for the migrating maintainer to refine.\n", + ); + s +} + +fn render_readme(config: &FrankieTopicConfig) -> String { + let display = config.display(); + let emoji = config.default_emoji.as_deref().unwrap_or(""); + let topic = config.topic_focus(); + + let mut s = String::new(); + s.push_str(&format!("# {} {} govbot deployment\n\n", emoji, display)); + s.push_str( + "This is a govbot+fastclass project scaffolded **from a Frankie-style\n\ + topic config**. The Frankie keyword list became your starter classifier;\n\ + everything else is yours to refine.\n\n", + ); + s.push_str("## What was generated\n\n"); + s.push_str("- `govbot.yml` — the project manifest (datasets, transforms, publishers).\n"); + s.push_str(&format!( + "- `classifier/` — a fastclass bundle with one tag (`{}`) carrying the\n Frankie keyword list as `include_keywords`.\n", + config.name + )); + s.push_str("- `classifier/eval/constitution.yml` — **placeholder** gold items;\n"); + s.push_str(" replace before relying on the improvement loop's judgement.\n"); + s.push_str("- `summarizer/prompt.md` — stub for the future `summarize` transform.\n\n"); + s.push_str("## What to do next\n\n"); + s.push_str("1. Install the Tier-2 semantic model:\n"); + s.push_str(" ```\n fastclass model fetch --bundle ./classifier\n ```\n"); + s.push_str("2. Replace the placeholder constitution items with real labeled examples:\n"); + s.push_str(" ```\n /fastclass:seed-gold ./classifier\n ```\n"); + s.push_str("3. Try classifying:\n"); + s.push_str(" ```\n govbot run --dry-run\n ```\n"); + s.push_str("4. Iterate quality via the improvement loop:\n"); + s.push_str(" ```\n /fastclass:improve autonomous\n ```\n"); + s.push_str("5. Set Bluesky credentials (env-only — never in `govbot.yml`):\n"); + s.push_str( + " ```\n export BLUESKY_HANDLE=...\n export BLUESKY_APP_PASSWORD=...\n ```\n\n", + ); + s.push_str(&format!( + "## Topic focus\n\n`{}` — used by the summarizer prompt and the tag\ndescription. Adjust as your editorial scope sharpens.\n", + topic + )); + s +} + +fn render_gitignore() -> String { + "# govbot — generated, reconstructed on every run\n\ + .govbot/\n\ + dist/\n\ + docs/\n\ + # Classification output from `govbot apply` — regenerated each run.\n\ + tags/\n\ + # Publisher state — append-only ledgers.\n\ + state/\n\ + # fastclass / govbot lockfiles\n\ + fastclass.lock\n\ + govbot.lock\n\ + # Bundled model artifacts (resolved by `fastclass model fetch`).\n\ + classifier/model/\n\ + classifier/model-rerank/\n\ + \n\ + # Secrets — never commit\n\ + .env\n" + .to_string() +} + +/// Quote a YAML scalar conservatively — escapes any embedded `"` and wraps in +/// double quotes. Used for keyword lines and titles, where the source can +/// carry characters that would otherwise confuse the YAML parser. +fn yaml_string(s: &str) -> String { + let escaped = s.replace('\\', "\\\\").replace('"', "\\\""); + format!("\"{}\"", escaped) +} + +// --------------------------------------------------------------------------- +// Unit tests +// --------------------------------------------------------------------------- + +#[cfg(test)] +mod tests { + use super::*; + + /// The parser must accept a minimal Frankie config and tolerate extra + /// fields the migration tool does not translate. + #[test] + fn frankie_config_parser_handles_minimal_config_with_extras() { + let yml = r#" +name: housing +display_name: Housing +default_emoji: 🏠 +keywords: + - affordable housing + - rent control + - eviction +emoji_map: + rent: 💵 + eviction: 🚪 +digest_title: "🏠 Housing Bills Weekly Digest" +topic: "housing policy" +# Extras Frankie carries that we don't translate yet: +schedule: weekly +timezone: America/Chicago +jurisdictions: + - il + - ca +"#; + let dir = tempfile::tempdir().unwrap(); + let path = dir.path().join("config.yml"); + std::fs::write(&path, yml).unwrap(); + + let parsed = FrankieTopicConfig::load(&path).expect("minimal Frankie config should parse"); + assert_eq!(parsed.name, "housing"); + assert_eq!(parsed.display(), "Housing"); + assert_eq!(parsed.default_emoji.as_deref(), Some("🏠")); + assert_eq!(parsed.keywords.len(), 3); + assert_eq!(parsed.emoji_map.get("rent").map(String::as_str), Some("💵")); + assert_eq!(parsed.topic_focus(), "housing policy"); + // Extra fields are absorbed, not rejected. + match parsed.extra { + serde_yaml::Value::Mapping(m) => { + assert!(m.contains_key(serde_yaml::Value::String("schedule".to_string()))); + assert!(m.contains_key(serde_yaml::Value::String("jurisdictions".to_string()))); + } + other => panic!("expected extras to land in a mapping, got: {:?}", other), + } + } + + /// Display falls back to title-casing `name` when `display_name` is absent. + #[test] + fn display_falls_back_to_title_case() { + let cfg = FrankieTopicConfig { + name: "transportation".to_string(), + display_name: None, + default_emoji: None, + keywords: vec![], + emoji_map: BTreeMap::new(), + digest_title: None, + topic: None, + extra: serde_yaml::Value::Null, + }; + assert_eq!(cfg.display(), "Transportation"); + } + + /// Empty `name` is rejected — the classifier needs a tag name. + #[test] + fn frankie_config_rejects_empty_name() { + let dir = tempfile::tempdir().unwrap(); + let path = dir.path().join("config.yml"); + std::fs::write(&path, "name: \"\"\nkeywords: []\n").unwrap(); + + let err = FrankieTopicConfig::load(&path).expect_err("empty name must be rejected"); + assert!(err.to_string().contains("empty `name`")); + } +} diff --git a/actions/govbot/src/lib.rs b/actions/govbot/src/lib.rs index e2cd937d..cf25daf0 100644 --- a/actions/govbot/src/lib.rs +++ b/actions/govbot/src/lib.rs @@ -3,37 +3,44 @@ //! This library provides a reactive stream-based API for discovering, filtering, //! sorting, and processing JSON log files from pipeline repositories. +pub mod bluesky; +pub mod cache; pub mod config; -pub mod embeddings; pub mod error; pub mod filter; pub mod git; -pub mod locale_generated; +pub mod init_from_frankie; +pub mod lock; pub mod pipeline; pub mod processor; pub mod publish; +pub mod registry; pub mod rss; pub mod selectors; +pub mod tagfile; pub mod types; pub mod wizard; -pub use config::{Config, ConfigBuilder, JoinOption, SortOrder}; -pub use embeddings::{ - hash_text, BillTagResult, ScoreBreakdown, TagDefinition, TagFile, TagFileMetadata, TagMatcher, +pub use config::{ + Command_, Config, ConfigBuilder, JoinOption, Manifest, Publisher, PublisherKind, SortOrder, + Transform, }; pub use error::{Error, Result}; pub use filter::{FilterAlias, FilterManager, FilterResult, LogFilter}; -pub use locale::WorkingLocale; -pub use locale_generated as locale; +pub use lock::LockFile; pub use processor::PipelineProcessor; +pub use registry::{DatasetEntry, Registry, ResolvedDataset}; +pub use tagfile::{ + hash_text, BillTagResult, ScoreBreakdown, TagDefinition, TagFile, TagFileMetadata, +}; pub use types::{LogContent, LogEntry, Metadata, VoteEventResult}; /// Re-export commonly used types for convenience pub mod prelude { pub use crate::config::{Config, ConfigBuilder, JoinOption, SortOrder}; pub use crate::error::{Error, Result}; - pub use crate::locale::WorkingLocale; pub use crate::processor::PipelineProcessor; + pub use crate::registry::Registry; pub use crate::types::{LogContent, LogEntry, Metadata, VoteEventResult}; pub use futures::StreamExt; } diff --git a/actions/govbot/src/locale_generated.rs b/actions/govbot/src/locale_generated.rs deleted file mode 100644 index 6ca3b4b5..00000000 --- a/actions/govbot/src/locale_generated.rs +++ /dev/null @@ -1,302 +0,0 @@ -//! Auto-generated locale enum from pipeline-manager config.yml -//! This file is generated by src/bin/generate-locale-enum.rs -//! Do not edit manually - regenerate using: just generate - -/// Locale codes for working pipelines -#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash, serde::Serialize, serde::Deserialize)] -#[serde(rename_all = "lowercase")] -pub enum WorkingLocale { - All, - AK, - AL, - AR, - CA, - CO, - DE, - FL, - GA, - GU, - HI, - IA, - ID, - IL, - IN, - KS, - KY, - LA, - MA, - MD, - ME, - MI, - MN, - MO, - MP, - MS, - MT, - NC, - ND, - NE, - NH, - NJ, - NM, - NV, - NY, - OH, - OK, - OR, - PA, - PR, - RI, - SC, - SD, - TN, - Usa, - UT, - VI, - VT, - WA, - WI, - WV, - WY, -} - -impl WorkingLocale { - /// Get all working locales as a vector (excludes All variant) - pub fn all() -> Vec { - vec![ - WorkingLocale::AK, - WorkingLocale::AL, - WorkingLocale::AR, - WorkingLocale::CA, - WorkingLocale::CO, - WorkingLocale::DE, - WorkingLocale::FL, - WorkingLocale::GA, - WorkingLocale::GU, - WorkingLocale::HI, - WorkingLocale::IA, - WorkingLocale::ID, - WorkingLocale::IL, - WorkingLocale::IN, - WorkingLocale::KS, - WorkingLocale::KY, - WorkingLocale::LA, - WorkingLocale::MA, - WorkingLocale::MD, - WorkingLocale::ME, - WorkingLocale::MI, - WorkingLocale::MN, - WorkingLocale::MO, - WorkingLocale::MP, - WorkingLocale::MS, - WorkingLocale::MT, - WorkingLocale::NC, - WorkingLocale::ND, - WorkingLocale::NE, - WorkingLocale::NH, - WorkingLocale::NJ, - WorkingLocale::NM, - WorkingLocale::NV, - WorkingLocale::NY, - WorkingLocale::OH, - WorkingLocale::OK, - WorkingLocale::OR, - WorkingLocale::PA, - WorkingLocale::PR, - WorkingLocale::RI, - WorkingLocale::SC, - WorkingLocale::SD, - WorkingLocale::TN, - WorkingLocale::Usa, - WorkingLocale::UT, - WorkingLocale::VI, - WorkingLocale::VT, - WorkingLocale::WA, - WorkingLocale::WI, - WorkingLocale::WV, - WorkingLocale::WY, - ] - } - - /// Get the locale code as a string - pub fn as_str(&self) -> &'static str { - match self { - WorkingLocale::All => "all", - WorkingLocale::AK => "ak", - WorkingLocale::AL => "al", - WorkingLocale::AR => "ar", - WorkingLocale::CA => "ca", - WorkingLocale::CO => "co", - WorkingLocale::DE => "de", - WorkingLocale::FL => "fl", - WorkingLocale::GA => "ga", - WorkingLocale::GU => "gu", - WorkingLocale::HI => "hi", - WorkingLocale::IA => "ia", - WorkingLocale::ID => "id", - WorkingLocale::IL => "il", - WorkingLocale::IN => "in", - WorkingLocale::KS => "ks", - WorkingLocale::KY => "ky", - WorkingLocale::LA => "la", - WorkingLocale::MA => "ma", - WorkingLocale::MD => "md", - WorkingLocale::ME => "me", - WorkingLocale::MI => "mi", - WorkingLocale::MN => "mn", - WorkingLocale::MO => "mo", - WorkingLocale::MP => "mp", - WorkingLocale::MS => "ms", - WorkingLocale::MT => "mt", - WorkingLocale::NC => "nc", - WorkingLocale::ND => "nd", - WorkingLocale::NE => "ne", - WorkingLocale::NH => "nh", - WorkingLocale::NJ => "nj", - WorkingLocale::NM => "nm", - WorkingLocale::NV => "nv", - WorkingLocale::NY => "ny", - WorkingLocale::OH => "oh", - WorkingLocale::OK => "ok", - WorkingLocale::OR => "or", - WorkingLocale::PA => "pa", - WorkingLocale::PR => "pr", - WorkingLocale::RI => "ri", - WorkingLocale::SC => "sc", - WorkingLocale::SD => "sd", - WorkingLocale::TN => "tn", - WorkingLocale::Usa => "usa", - WorkingLocale::UT => "ut", - WorkingLocale::VI => "vi", - WorkingLocale::VT => "vt", - WorkingLocale::WA => "wa", - WorkingLocale::WI => "wi", - WorkingLocale::WV => "wv", - WorkingLocale::WY => "wy", - } - } - - /// Get the locale code in lowercase - pub fn as_lowercase(&self) -> &'static str { - match self { - WorkingLocale::All => "all", - WorkingLocale::AK => "ak", - WorkingLocale::AL => "al", - WorkingLocale::AR => "ar", - WorkingLocale::CA => "ca", - WorkingLocale::CO => "co", - WorkingLocale::DE => "de", - WorkingLocale::FL => "fl", - WorkingLocale::GA => "ga", - WorkingLocale::GU => "gu", - WorkingLocale::HI => "hi", - WorkingLocale::IA => "ia", - WorkingLocale::ID => "id", - WorkingLocale::IL => "il", - WorkingLocale::IN => "in", - WorkingLocale::KS => "ks", - WorkingLocale::KY => "ky", - WorkingLocale::LA => "la", - WorkingLocale::MA => "ma", - WorkingLocale::MD => "md", - WorkingLocale::ME => "me", - WorkingLocale::MI => "mi", - WorkingLocale::MN => "mn", - WorkingLocale::MO => "mo", - WorkingLocale::MP => "mp", - WorkingLocale::MS => "ms", - WorkingLocale::MT => "mt", - WorkingLocale::NC => "nc", - WorkingLocale::ND => "nd", - WorkingLocale::NE => "ne", - WorkingLocale::NH => "nh", - WorkingLocale::NJ => "nj", - WorkingLocale::NM => "nm", - WorkingLocale::NV => "nv", - WorkingLocale::NY => "ny", - WorkingLocale::OH => "oh", - WorkingLocale::OK => "ok", - WorkingLocale::OR => "or", - WorkingLocale::PA => "pa", - WorkingLocale::PR => "pr", - WorkingLocale::RI => "ri", - WorkingLocale::SC => "sc", - WorkingLocale::SD => "sd", - WorkingLocale::TN => "tn", - WorkingLocale::Usa => "usa", - WorkingLocale::UT => "ut", - WorkingLocale::VI => "vi", - WorkingLocale::VT => "vt", - WorkingLocale::WA => "wa", - WorkingLocale::WI => "wi", - WorkingLocale::WV => "wv", - WorkingLocale::WY => "wy", - } - } -} - -impl From<&str> for WorkingLocale { - fn from(s: &str) -> Self { - match s.to_lowercase().as_str() { - "all" => WorkingLocale::All, - "ak" => WorkingLocale::AK, - "al" => WorkingLocale::AL, - "ar" => WorkingLocale::AR, - "ca" => WorkingLocale::CA, - "co" => WorkingLocale::CO, - "de" => WorkingLocale::DE, - "fl" => WorkingLocale::FL, - "ga" => WorkingLocale::GA, - "gu" => WorkingLocale::GU, - "hi" => WorkingLocale::HI, - "ia" => WorkingLocale::IA, - "id" => WorkingLocale::ID, - "il" => WorkingLocale::IL, - "in" => WorkingLocale::IN, - "ks" => WorkingLocale::KS, - "ky" => WorkingLocale::KY, - "la" => WorkingLocale::LA, - "ma" => WorkingLocale::MA, - "md" => WorkingLocale::MD, - "me" => WorkingLocale::ME, - "mi" => WorkingLocale::MI, - "mn" => WorkingLocale::MN, - "mo" => WorkingLocale::MO, - "mp" => WorkingLocale::MP, - "ms" => WorkingLocale::MS, - "mt" => WorkingLocale::MT, - "nc" => WorkingLocale::NC, - "nd" => WorkingLocale::ND, - "ne" => WorkingLocale::NE, - "nh" => WorkingLocale::NH, - "nj" => WorkingLocale::NJ, - "nm" => WorkingLocale::NM, - "nv" => WorkingLocale::NV, - "ny" => WorkingLocale::NY, - "oh" => WorkingLocale::OH, - "ok" => WorkingLocale::OK, - "or" => WorkingLocale::OR, - "pa" => WorkingLocale::PA, - "pr" => WorkingLocale::PR, - "ri" => WorkingLocale::RI, - "sc" => WorkingLocale::SC, - "sd" => WorkingLocale::SD, - "tn" => WorkingLocale::TN, - "usa" => WorkingLocale::Usa, - "ut" => WorkingLocale::UT, - "vi" => WorkingLocale::VI, - "vt" => WorkingLocale::VT, - "wa" => WorkingLocale::WA, - "wi" => WorkingLocale::WI, - "wv" => WorkingLocale::WV, - "wy" => WorkingLocale::WY, - _ => panic!("Invalid working locale: {}", s), - } - } -} - -impl std::fmt::Display for WorkingLocale { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - write!(f, "{}", self.as_lowercase()) - } -} diff --git a/actions/govbot/src/lock.rs b/actions/govbot/src/lock.rs new file mode 100644 index 00000000..b01d2813 --- /dev/null +++ b/actions/govbot/src/lock.rs @@ -0,0 +1,173 @@ +//! `govbot.lock` — the dataset lockfile, for reproducible runs. +//! +//! `govbot.yml` declares *which* datasets a project wants; `govbot.lock` +//! records the *exact git commit* each resolved to, so a run on another +//! machine (or a re-run weeks later) processes byte-identical data. It is the +//! `package-lock.json` / `Cargo.lock` of govbot. +//! +//! ## When it is written +//! +//! `govbot pull` and `govbot run` write/update `govbot.lock` next to +//! `govbot.yml` after resolving and fetching datasets — recording each +//! dataset's canonical id, git URL, channel, the cloned commit SHA, the +//! content-addressed cache key, and the resolve timestamp. +//! +//! ## Format +//! +//! `govbot.lock` is JSON (stable, diff-friendly, no YAML ambiguity): +//! +//! ```json +//! { +//! "lockfile_version": 1, +//! "generated_at": "2026-05-22T12:00:00Z", +//! "datasets": { +//! "us-legislation/wy": { +//! "git_url": "https://github.com/chn-openstates-files/wy-legislation.git", +//! "channel": null, +//! "commit": "a1b2c3d4e5f6...", +//! "cache_key": "wy-legislation-3f9a1c20e5b4", +//! "resolved_at": "2026-05-22T12:00:00Z" +//! } +//! } +//! } +//! ``` +//! +//! Keys are canonical `namespace/name` ids; the map is sorted for a stable +//! diff. The lockfile SHOULD be committed to a project's git repo. + +use crate::error::{Error, Result}; +use serde::{Deserialize, Serialize}; +use std::collections::BTreeMap; +use std::path::{Path, PathBuf}; + +/// The current lockfile format version. +pub const LOCKFILE_VERSION: u32 = 1; + +/// The lockfile filename, written next to `govbot.yml`. +pub const LOCKFILE_NAME: &str = "govbot.lock"; + +/// One pinned dataset. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct LockedDataset { + /// The git URL the dataset was cloned from. + pub git_url: String, + /// The requested channel (branch), if any. + pub channel: Option, + /// The exact commit SHA the dataset is pinned to. + pub commit: String, + /// The shared-cache key the dataset's clone lives under. + pub cache_key: String, + /// When this dataset was last resolved (RFC 3339 UTC). + pub resolved_at: String, +} + +/// The whole `govbot.lock` file. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct LockFile { + /// Lockfile format version. + pub lockfile_version: u32, + /// When the lockfile was last written (RFC 3339 UTC). + pub generated_at: String, + /// Canonical `namespace/name` → pin. Sorted for a stable diff. + pub datasets: BTreeMap, +} + +impl Default for LockFile { + fn default() -> Self { + LockFile { + lockfile_version: LOCKFILE_VERSION, + generated_at: now_rfc3339(), + datasets: BTreeMap::new(), + } + } +} + +impl LockFile { + /// The lockfile path for a project (the directory holding `govbot.yml`). + pub fn path_for(project_dir: &Path) -> PathBuf { + project_dir.join(LOCKFILE_NAME) + } + + /// Load an existing lockfile, or an empty one if none exists yet. + pub fn load_or_default(project_dir: &Path) -> Result { + let path = LockFile::path_for(project_dir); + if !path.is_file() { + return Ok(LockFile::default()); + } + let contents = std::fs::read_to_string(&path) + .map_err(|e| Error::Config(format!("Failed to read {}: {}", path.display(), e)))?; + serde_json::from_str(&contents) + .map_err(|e| Error::Config(format!("Invalid {}: {}", path.display(), e))) + } + + /// Record (or overwrite) a dataset's pin. + pub fn pin( + &mut self, + canonical_id: &str, + git_url: &str, + channel: Option<&str>, + commit: &str, + cache_key: &str, + ) { + self.datasets.insert( + canonical_id.to_string(), + LockedDataset { + git_url: git_url.to_string(), + channel: channel.map(|c| c.to_string()), + commit: commit.to_string(), + cache_key: cache_key.to_string(), + resolved_at: now_rfc3339(), + }, + ); + } + + /// Write the lockfile to `/govbot.lock`, pretty-printed, + /// refreshing `generated_at`. + pub fn save(&mut self, project_dir: &Path) -> Result<()> { + self.lockfile_version = LOCKFILE_VERSION; + self.generated_at = now_rfc3339(); + let path = LockFile::path_for(project_dir); + let json = serde_json::to_string_pretty(self) + .map_err(|e| Error::Config(format!("Failed to serialize lockfile: {}", e)))?; + std::fs::write(&path, format!("{}\n", json)) + .map_err(|e| Error::Config(format!("Failed to write {}: {}", path.display(), e)))?; + Ok(()) + } +} + +/// The current time as an RFC 3339 UTC string. +fn now_rfc3339() -> String { + chrono::Utc::now().to_rfc3339_opts(chrono::SecondsFormat::Secs, true) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn round_trips_through_disk() { + let dir = tempfile::tempdir().unwrap(); + let mut lock = LockFile::default(); + lock.pin( + "us-legislation/wy", + "https://example.com/wy.git", + None, + "abc123", + "wy-legislation-deadbeef", + ); + lock.save(dir.path()).unwrap(); + + let reloaded = LockFile::load_or_default(dir.path()).unwrap(); + assert_eq!(reloaded.lockfile_version, LOCKFILE_VERSION); + let wy = reloaded.datasets.get("us-legislation/wy").unwrap(); + assert_eq!(wy.commit, "abc123"); + assert_eq!(wy.cache_key, "wy-legislation-deadbeef"); + } + + #[test] + fn missing_lockfile_is_empty_default() { + let dir = tempfile::tempdir().unwrap(); + let lock = LockFile::load_or_default(dir.path()).unwrap(); + assert!(lock.datasets.is_empty()); + } +} diff --git a/actions/govbot/src/main.rs b/actions/govbot/src/main.rs index 0b59ab15..647ebbc1 100644 --- a/actions/govbot/src/main.rs +++ b/actions/govbot/src/main.rs @@ -1,18 +1,64 @@ +//! # govbot — a 4-tool civic-data publishing stack +//! +//! govbot exists so a small activist crew can run a credible legislative +//! news bot at **nearly-free** cost on commodity infrastructure (GitHub +//! Actions + a laptop with local models). The first user is the +//! `climate-activist` userland repo; the success bar is "Bluesky posts +//! worth reading, nearly free to run/improve". +//! +//! The stack is four composable tools: +//! +//! 1. **Select real gov data** — `govbot pull` clones the legislation of +//! all 50 states, DC, the territories, and federal Congress from a +//! content-addressed registry of git repos (scrapers thanks to +//! OpenStates). Today `govbot source --select docs` projects bill +//! text + subjects; sponsors and voting records are captured in the +//! underlying metadata but not yet in the docs projection. +//! 2. **Filter / transform** — fastclass tagging is the shipped +//! transform: a low-token, high-quality classifier the activist +//! tunes against their own issue taxonomy, piped over the stream +//! protocol (see `schemas/STREAM_PROTOCOL.md`). The planned +//! `summarize` transform — a local-LLM digest of grouped bills +//! emitted with a deterministic trace (model id + source bill ids + +//! prompt revision) — is not yet built. +//! 3. **Publish with receipts** — RSS, HTML, JSON, DuckDB, and a +//! Bluesky posting bot ship today. The defining roadmap idea is the +//! **receipt**: a GitHub Pages artifact that carries the +//! deterministic provenance behind every AI digest (model used, +//! source bills, fastclass reasoning, regen command) so the short +//! Bluesky post can link to a trustworthy long form. The AI digest +//! publisher and the receipt artifact are not yet built; the X +//! publisher is not yet built. +//! 4. **Coding-agent-native dev experience** — `AGENT.md` is a self- +//! contained playbook a fresh Claude Code session can follow to +//! make / manage / update a govbot project. The fastclass plugin +//! (`/fastclass:from-intent`, `/fastclass:improve`, +//! `/fastclass:ratify`, `/fastclass:install-model`) handles the +//! classifier loop; `govbot doctor` validates installations. The +//! "build your own govbot" path is the one tool already shipping +//! its vision. +//! +//! This binary is the gov-data CLI piece of the stack. It owns dataset +//! pull/cache/lock, the stream-protocol `source` and `apply` stages, +//! the manifest-driven `run` orchestrator, and the publisher set above. +//! Classification is intentionally a separate binary (`fastclass`) so +//! the activist can tune the taxonomy without touching this code. + use clap::{Parser, Subcommand}; -use govbot::git; -use govbot::{TagMatcher, hash_text, TagFile, TagFileMetadata, BillTagResult}; -use govbot::selectors::ocd_files_select_default; -use govbot::publish::{load_config, get_repos_from_config, filter_by_tags, deduplicate_entries, sort_by_timestamp}; -use govbot::rss; -use futures::StreamExt; use futures::stream; -use std::io::{self, Write, BufRead, BufReader}; -use std::path::PathBuf; -use serde_json; +use futures::StreamExt; +use govbot::git; +use govbot::lock::LockFile; +use govbot::publish::{deduplicate_entries, filter_by_tags, load_manifest, sort_by_timestamp}; +use govbot::registry::Registry; +use govbot::selectors::{ocd_files_extract_subjects, ocd_files_select_default}; +use govbot::{hash_text, BillTagResult, TagFile, TagFileMetadata}; use jwalk::WalkDir; +use std::collections::HashMap; use std::fs; +use std::io::{self, BufRead, BufReader, Write}; +use std::path::{Path, PathBuf}; use std::process::Command as ProcessCommand; -use std::collections::HashMap; /// Write a line to stdout, gracefully handling broken pipe errors /// This is essential for piping to tools like yq, jq, etc. @@ -40,18 +86,34 @@ fn write_json_line(line: &str) -> io::Result<()> { #[derive(Debug, Clone)] struct CloneResult { locale: String, - result: String, // "cloned", "pulled", "no_updates", "failed" + result: String, // emoji, or "failed" position: String, // "1/37" size: Option, local_size: Option, final_size: Option, error: Option, + /// On success: the canonical registry id, git URL, channel, resolved + /// commit SHA, and cache key — recorded into `govbot.lock`. + pin: Option, +} + +/// A resolved dataset pin, captured during a successful clone/pull for the +/// lockfile. +#[derive(Debug, Clone)] +struct DatasetPin { + canonical_id: String, + git_url: String, + channel: Option, + commit: String, + cache_key: String, } -/// Type-safe, functional reactive processor for pipeline log files +/// govbot — gov-data package manager and transform/publish orchestrator. #[derive(Parser, Debug)] #[command(name = "govbot")] -#[command(about = "Process pipeline log files with type-safe reactive streams")] +#[command( + about = "govbot — a 4-tool civic-data publishing stack. (1) Select real gov data: pull the legislation of all 50 states, DC, territories, and federal Congress from a content-addressed dataset registry. (2) Filter/transform: run transforms over the stream — fastclass tagging today, local-LLM summarize on the roadmap. (3) Publish with receipts: RSS / HTML / JSON / DuckDB / Bluesky today, plus a roadmap GitHub Pages 'receipts' artifact that carries deterministic provenance behind every AI digest. (4) Coding-agent-native dev experience: AGENT.md walks Claude Code through make / manage / update of a project. Configured by a govbot.yml manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook, README for the honest gap map." +)] #[command(version)] struct Args { #[command(subcommand)] @@ -60,11 +122,12 @@ struct Args { #[derive(Subcommand, Debug)] enum Command { - /// Clone or pull data pipeline repositories (default: updates existing repos) - /// Clones if repository doesn't exist, pulls if it does - /// Use "govbot clone all" to clone all repos, or "govbot clone " for specific repos - Clone { - /// Repository names to clone/pull (e.g., usa, il, ca, or "all" for all repos). If not specified, updates existing repos. + /// Pull (clone or update) dataset repositories into the shared cache and + /// link them into the project. Use `govbot pull all` to pull every dataset, + /// `govbot pull ...` for specific ones, or `govbot pull` with no args + /// to refresh whatever's already linked into the project. + Pull { + /// Dataset identifiers to pull (e.g. `wy`, `il`, `us-legislation/ca`, or `all`). With no args, refreshes datasets already linked into the project. #[arg(num_args = 0..)] repos: Vec, @@ -84,17 +147,20 @@ enum Command { #[arg(long)] verbose: bool, - /// List available repos instead of cloning/pulling + /// List available datasets instead of pulling #[arg(long)] list: bool, }, - /// Process and display pipeline log files - Logs { - /// Repos to output (default: `all`) `--repos="il,ca"` - #[arg(long, num_args = 0..)] + /// Stream dataset records as JSON Lines — the govbot stream-protocol + /// `source` stage. Pipe into a transform (`fastclass classify -`) or into + /// `govbot apply` for the persistence sink. See `schemas/STREAM_PROTOCOL.md`. + Source { + /// Datasets to emit (default: every linked dataset). Accepts the same + /// identifiers as `govbot pull` (`wy`, `il`, `us-legislation/ca`). + #[arg(long = "datasets", visible_alias = "repos", num_args = 0..)] repos: Vec, - + /// Per repo limit (default: 100) options: `none` | number #[arg(long, default_value = "100")] limit: String, @@ -103,11 +169,26 @@ enum Command { #[arg(long, default_value = "bill,tags")] join: String, - /// Select/transform fields (default: `default`) - applies extract_text_from_json transformation - #[arg(long, default_value = "default", value_parser = ["default"])] + /// Select/transform fields (default: `default`). `docs` emits one + /// `{"id","text","kind":"docs"}` JSON object per entry carrying the + /// FULL bill text — the stream-protocol document `fastclass classify -` + /// consumes. + #[arg(long, default_value = "default", value_parser = ["default", "docs"])] select: String, - /// Filter log entries based on per-repo AI generated filters (default: `default`) options: `default` | `none` + /// Per-repo log filter (default: `default`). Options: `default` | + /// `none`. `default` applies the per-dataset filter under + /// `src/filters//default.rs` — it drops *routine* log + /// actions (introductions, committee referrals, "Bill Number + /// Assigned", "Placed on General File", boilerplate "President + /// Signed" log lines, etc.) so the stream emits only **substantive** + /// events: passage votes, executive signatures, amendments, defeats. + /// `none` keeps every log entry. The default filter is action-based, + /// not date-based: a bill whose only logs are routine actions + /// (e.g. a freshly-filed bill with just an "Introduction" log) will + /// emit zero records under `--filter default` until a substantive + /// event lands. Use `--filter none` to confirm a bill is missing + /// because of the filter rather than a data problem. #[arg(long, default_value = "default", value_parser = ["default", "none"])] filter: String, @@ -117,13 +198,15 @@ enum Command { /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] - govbot_dir: Option, + govbot_dir: Option, }, - /// Delete data pipeline repositories - /// Deletes local repository directories for specified locales + /// Delete locally-linked dataset clones from the project's `.govbot/repos/`. + /// Use `govbot delete all` to clear every linked dataset, or + /// `govbot delete ...` for specific ones. The shared cache at + /// `~/.govbot/cache/` is not touched — a subsequent `pull` re-links instantly. Delete { - /// Locale names to delete (e.g., usa, il, ca, or "all" for all locales). Use "all" to delete all repositories. + /// Dataset identifiers to unlink (e.g. `wy`, `il`, `us-legislation/ca`, or `all`). #[arg(num_args = 0..)] locales: Vec, @@ -140,9 +223,11 @@ enum Command { verbose: bool, }, - /// Load bill metadata into a DuckDB database file - /// Loads all metadata.json files from cloned repos into a DuckDB database for analysis. - /// The database file is saved in the base govbot directory (e.g., ./.govbot/govbot.duckdb) + /// Load bill metadata into a DuckDB database for SQL analysis. Walks every + /// linked dataset's `metadata.json` files, creates a `bills` table + a + /// `bills_summary` view, and writes the database into the base govbot + /// directory (default `./.govbot/govbot.duckdb`). Requires the `duckdb` CLI + /// on PATH. Load { /// Output database filename (default: govbot.duckdb). Saved in the base govbot directory. #[arg(long, default_value = "govbot.duckdb")] @@ -161,57 +246,224 @@ enum Command { threads: Option, }, - /// Update govbot to the latest nightly version - /// Downloads and installs the latest nightly build from GitHub releases + /// Update the installed govbot binary to the latest nightly build from + /// GitHub releases. Installs into `~/.govbot/bin/govbot` and prefers the + /// platform-native `.tar.gz` asset. Update, - /// Build RSS feed and HTML index from govbot.yml configuration - /// Generates a combined RSS feed and HTML index from logs filtered by tags in govbot.yml - Build { - /// Specific tags to include in feed (default: all tags from govbot.yml) - #[arg(long, num_args = 0..)] - tags: Vec, - - /// Limit number of entries per feed (default: 100, use "none" for all entries) + /// Run one or more publishers from `govbot.yml: publish:`. A publisher + /// consumes the tagged result stream and emits artifacts: `rss`/`html`/`json` + /// write feed/index/dump files, `duckdb` loads records into a database, + /// `bluesky` posts matches to a Bluesky account (always dry-run first with + /// `--dry-run`). + Publish { + /// Publisher name(s) from govbot.yml `publish:` (default: every publisher) + #[arg(long = "publisher", num_args = 0..)] + publishers: Vec, + + /// Limit number of entries per artifact (default: 100, use "none" for all entries) #[arg(long)] limit: Option, - - /// Output directory for RSS feed and HTML (default: from govbot.yml build.output_dir, or "docs") + + /// Output directory override (default: from the publisher's output_dir, or "docs") #[arg(long)] output_dir: Option, - - /// Output filename for RSS feed (default: from govbot.yml build.output_file, or "feed.xml") + + /// Output filename override (default: from the publisher's output_file, or "feed.xml") #[arg(long)] output_file: Option, - + + /// Render but do not emit. The `bluesky` publisher honours this by + /// printing the posts it would send and touching no network/ledger. + #[arg(long = "dry-run")] + dry_run: bool, + /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] govbot_dir: Option, }, - /// Tag bills using semantic or built-in similarity based on govbot.yml in the current directory. - /// Reads JSON lines from stdin (from `govbot logs`), processes entries with bill identifiers, - /// and writes per-tag files under the directory containing govbot.yml. - /// By default, acts as a filter: only outputs lines that match tags. - /// If a tag name is provided, only processes and outputs lines matching that specific tag. - Tag { - /// Optional tag name to filter to a specific tag (e.g., "lgbtq", "budget") + /// Persist fastclass classification results as tag files under the + /// project's `tags/` output directory. Reads `fastclass classify` result + /// JSON from stdin — the apply sink of + /// `govbot source --select docs | fastclass classify - | govbot apply` — + /// and writes per-tag `.tag.json` files under + /// `/tags//country:.../sessions//`, the files + /// `govbot publish` turns into feeds. Classification itself is done by + /// fastclass; `govbot apply` only stores the results. `tags/` is a + /// project-rooted classification-output dir — peer to `dist/` (publisher + /// output) and distinct from `.govbot/` (the tool's regenerable cache). + Apply { + /// Optional tag name: persist only this tag's matches tag_name: Option, - /// Output directory (defaults to the directory containing govbot.yml) + /// Output directory (default: `/tags/`). Overrides the + /// default routing entirely — the dataset short-name is dropped and + /// tag files land under `/country:.../sessions/.../tags/`. #[arg(long = "output-dir")] output_dir: Option, + /// Overwrite a bill's tag entry even if it is already present + #[arg(long)] + overwrite: bool, + }, + + /// Run the full pipeline against the current directory's `govbot.yml`: + /// pull/update datasets → `source --select docs | fastclass classify - | apply` + /// (the classify transform) → publish every configured publisher. + /// `govbot` with no arguments is equivalent (and falls back to `init` if no + /// `govbot.yml` is present). + Run { + /// Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var) + #[arg(long = "govbot-dir")] + govbot_dir: Option, + + /// Render but do not emit. Propagates to every publisher — the + /// `bluesky` publisher honours this by printing the posts it would + /// send and touching no network/ledger. Recommended for first runs: + /// a missing-cred `bluesky` publisher already auto-skips with a + /// WARN, but `--dry-run` makes it explicit. + #[arg(long = "dry-run")] + dry_run: bool, + }, + + /// Scaffold a new govbot.yml in the current directory (the setup wizard). + /// Interactive in a TTY; writes sensible defaults when non-interactive. + /// + /// `--from-frankie-config ` bypasses the wizard and scaffolds a + /// govbot+fastclass project skeleton from a Frankie-style + /// `topics//config.yml` — the migration tool for existing + /// CHN-Bluesky-Govbot topic maintainers moving to the new stack. + Init { + /// Path to a Frankie-style topics//config.yml. When set, govbot init + /// generates a govbot+fastclass project skeleton from the CHN-Bluesky-Govbot + /// framework's per-topic shape (keyword list + emoji map + summary focus) + /// instead of running the interactive wizard. + #[arg(long = "from-frankie-config")] + from_frankie_config: Option, + + /// Where to scaffold the project. Default: cwd. + #[arg(long = "into")] + into: Option, + }, + + /// Add one or more datasets to the project's `govbot.yml` `datasets:` list. + /// Each id is validated against the registry before it is added. + Add { + /// Dataset identifiers to add (e.g. `wy`, `il`, `us-legislation/ca`). + #[arg(num_args = 1..)] + datasets: Vec, + }, + + /// Remove one or more datasets from the project's `govbot.yml`. + Remove { + /// Dataset identifiers to remove from `datasets:`. + #[arg(num_args = 1..)] + datasets: Vec, + }, + + /// List datasets — the project's manifest datasets and the ones cached + /// locally. With no manifest, lists every dataset in the registry. + Ls { /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env var) #[arg(long = "govbot-dir")] govbot_dir: Option, - /// Force re-tagging even if bill already exists in tag files - #[arg(long)] - overwrite: bool, + /// Emit machine-readable JSON instead of a human table. + #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] + output: String, }, -} + /// Search the dataset registry. A blank query lists every dataset. + Search { + /// Query matched against dataset ids and names (case-insensitive). + #[arg(num_args = 0..)] + query: Vec, + + /// Emit machine-readable JSON instead of a human table. + #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] + output: String, + }, + + /// Check that the project's pulled datasets are coherent. A data-integrity smoke test, runnable after `govbot pull all` or before `govbot run` in production. Walks every linked dataset and verifies that the `govbot source --select docs` stream is well-formed: every linked dataset entry resolves to a real directory, per-dataset ids don't collapse onto a handful (the bug-7592418 signature), every sampled `id` resolves to a present and parseable `metadata.json`, and every sampled `text` is non-trivial. Zero-record datasets are surfaced as warnings rather than errors — `--filter default` can legitimately drop every routine log. Exits non-zero on any failure so it can drop straight into a CI step. Skips cleanly when the cache is empty — this is a smoke test, not a unit test. + Doctor { + /// Govbot directory (default: $CWD/.govbot, or GOVBOT_DIR env var) + #[arg(long = "govbot-dir")] + govbot_dir: Option, + + /// Records to sample per dataset for the metadata.json and + /// text-length checks (default: 20). The id-distinctness and + /// coverage checks always cover every emitted record. + #[arg(long = "sample", default_value_t = 20)] + sample: usize, + + /// Per-dataset emit limit fed through to `govbot source --limit` + /// (default: 100, matching the source default — the smoke-test + /// sweet spot for a typical 55-state pull in <60s). Use "none" + /// for an exhaustive sweep at the cost of runtime. + #[arg(long = "limit", default_value = "100")] + limit: String, + + /// Emit a machine-readable JSON report instead of the human summary. + /// Suitable for piping into a CI step. + #[arg(long = "output", value_parser = ["text", "json"], default_value = "text")] + output: String, + }, + + /// **Deprecated.** Alias for `govbot source` (default mode) preserved so + /// existing consumers (the CHN-Bluesky-Govbot-Main framework, anyone + /// running `govbot logs > bills.jsonl`) keep working after the + /// Logs→Source rename. Prints a deprecation warning to stderr on + /// invocation. Will be removed in a future major version. + /// + /// The flag surface mirrors `govbot source` exactly — every flag that + /// `Source` accepts is honored here and forwarded verbatim. Anything + /// Frankie's `govbot logs > bills.jsonl` invocation might pass keeps + /// working. + Logs { + /// Datasets to emit (default: every linked dataset). Mirrors + /// `govbot source --datasets/--repos`. + #[arg(long = "datasets", visible_alias = "repos", num_args = 0..)] + repos: Vec, + + /// Per repo limit (default: 100) options: `none` | number. Mirrors + /// `govbot source --limit`. + #[arg(long, default_value = "100")] + limit: String, + + /// Join additional datasets (default: `bill,tags`). Mirrors + /// `govbot source --join`. + #[arg(long, default_value = "bill,tags")] + join: String, + + /// Select/transform fields (default: `default`). Mirrors + /// `govbot source --select`. Frankie's `govbot logs > bills.jsonl` + /// runs with the default — emitting the full joined record his + /// `scripts/post_to_bluesky.py` parses. + #[arg(long, default_value = "default", value_parser = ["default", "docs"])] + select: String, + + /// Per-repo log filter (default: `none` — every log entry, for + /// back-compat with the CHN-Bluesky-Govbot-Main framework's + /// `scripts/post_to_bluesky.py`, which was written against the + /// pre-Source-rename `govbot logs` output that did not filter). + /// Opt into the action-based filter (drops routine introductions, + /// committee referrals, "Bill Number Assigned" lines, etc.) with + /// `--filter default`. Same values as `govbot source --filter`, + /// only the default differs. + #[arg(long, default_value = "none", value_parser = ["default", "none"])] + filter: String, + + /// Sort order (default: DESC). Mirrors `govbot source --sort`. + #[arg(long, default_value = "DESC", value_parser = ["ASC", "DESC"])] + sort: String, + + /// Govbot directory (default: $CWD/.govbot/repos, or GOVBOT_DIR env + /// var). Mirrors `govbot source --govbot-dir`. + #[arg(long = "govbot-dir")] + govbot_dir: Option, + }, +} fn get_govbot_dir(govbot_dir: Option) -> anyhow::Result { // Check flag first, then environment variable, then default @@ -227,65 +479,92 @@ fn get_govbot_dir(govbot_dir: Option) -> anyhow::Result { } } -/// Process a single locale clone/pull operation -fn process_single_locale( - locale: &str, +/// The directory holding the project's `govbot.yml` (and where `govbot.lock` +/// is written) — the current working directory. +fn project_dir() -> anyhow::Result { + std::env::current_dir().map_err(|e| anyhow::anyhow!("Could not determine cwd: {}", e)) +} + +/// Load the active dataset registry for the current project. +fn load_registry() -> anyhow::Result { + let dir = project_dir()?; + Registry::load(&dir).map_err(|e| anyhow::anyhow!("{}", e)) +} + +/// Process a single dataset clone/pull operation. +/// +/// Resolution is registry-driven: the dataset is cloned once into the shared +/// `~/.govbot/cache/` and linked into the project's `repos/`. The resolved +/// commit SHA is captured for `govbot.lock`. +fn process_single_dataset( + dataset: &govbot::ResolvedDataset, repos_dir: &PathBuf, token_str: Option<&str>, verbose: bool, ) -> CloneResult { - let repo_name = git::build_repo_name(locale); - let target_dir = repos_dir.join(&repo_name); - + let short = dataset.short_name().to_string(); + let target_dir = repos_dir.join(git::repo_dir_name(&short)); + let local_size = if target_dir.exists() { git::get_directory_size(&target_dir).unwrap_or(0) } else { 0 }; - - match git::clone_or_pull_repo_quiet(locale, repos_dir, token_str, !verbose) { - Ok(action) => { + + match git::clone_or_pull_dataset(dataset, repos_dir, token_str, !verbose) { + Ok(outcome) => { let final_size = if target_dir.exists() { git::get_directory_size(&target_dir).unwrap_or(0) } else { 0 }; - - let result = match action { + + let result = match outcome.action { "clone" => "🆕", "pulled" => "⬇️", "no_updates" => "✅", "recloned" => "🔄", _ => "processed", }; - + let mut clone_result = CloneResult { - locale: locale.to_string(), + locale: short.clone(), result: result.to_string(), - position: String::new(), // Will be set by caller + position: String::new(), size: None, local_size: None, final_size: None, error: None, + pin: Some(DatasetPin { + canonical_id: dataset.id.clone(), + git_url: dataset.entry.git_url.clone(), + channel: dataset.channel.clone(), + commit: outcome.commit.clone(), + cache_key: outcome.cache_key.clone(), + }), }; - - if action == "clone" || action == "recloned" || action == "no_updates" { + + if outcome.action == "clone" + || outcome.action == "recloned" + || outcome.action == "no_updates" + { clone_result.size = Some(git::format_size(final_size)); } else { clone_result.local_size = Some(git::format_size(local_size)); clone_result.final_size = Some(git::format_size(final_size)); } - + clone_result } Err(e) => CloneResult { - locale: locale.to_string(), + locale: short, result: "failed".to_string(), - position: String::new(), // Will be set by caller + position: String::new(), size: None, local_size: None, final_size: None, error: Some(e.to_string()), + pin: None, }, } } @@ -302,15 +581,17 @@ fn print_result(result: &CloneResult) { } else { let size_str = if let Some(ref size) = result.size { size.clone() - } else if let (Some(ref local), Some(ref final_size)) = (&result.local_size, &result.final_size) { + } else if let (Some(ref local), Some(ref final_size)) = + (&result.local_size, &result.final_size) + { format!("{} -> {}", local, final_size) } else { String::new() }; - + // result.result now contains the emoji directly (🆕, ⬇️, ✅, 🔄) let action_emoji = &result.result; - + if !size_str.is_empty() { eprintln!("{} {:<6} [{}]", action_emoji, result.locale, size_str); } else { @@ -323,19 +604,19 @@ fn print_result(result: &CloneResult) { /// Perform clone/pull operations and print results as they complete async fn perform_clone_operations( - repos_to_clone: Vec, + datasets: Vec, repos_dir: PathBuf, token_str: Option<&str>, num_jobs: usize, verbose: bool, ) -> anyhow::Result> { - let total = repos_to_clone.len(); + let total = datasets.len(); let mut all_results = Vec::new(); - + if total == 1 || num_jobs == 1 { // Sequential clone/pull - print as we go - for (idx, locale) in repos_to_clone.iter().enumerate() { - let mut result = process_single_locale(locale, &repos_dir, token_str, verbose); + for (idx, dataset) in datasets.iter().enumerate() { + let mut result = process_single_dataset(dataset, &repos_dir, token_str, verbose); result.position = format!("{}/{}", idx + 1, total); print_result(&result); all_results.push(result); @@ -344,18 +625,22 @@ async fn perform_clone_operations( // Parallel clone/pull - print as results come in use std::sync::{Arc, Mutex}; let completed = Arc::new(Mutex::new(0usize)); - - let clone_futures = stream::iter(repos_to_clone.iter()) - .map(|locale| { - let locale = locale.clone(); + + let clone_futures = stream::iter(datasets.into_iter()) + .map(|dataset| { let repos_dir = repos_dir.clone(); let token = token_str.map(|s| s.to_string()); let completed = completed.clone(); let total = total; let verbose_flag = verbose; - + tokio::task::spawn_blocking(move || { - let mut result = process_single_locale(&locale, &repos_dir, token.as_deref(), verbose_flag); + let mut result = process_single_dataset( + &dataset, + &repos_dir, + token.as_deref(), + verbose_flag, + ); let mut count = completed.lock().unwrap(); *count += 1; result.position = format!("{}/{}", *count, total); @@ -365,7 +650,7 @@ async fn perform_clone_operations( .buffer_unordered(num_jobs); let mut stream = clone_futures; - + while let Some(result) = stream.next().await { match result { Ok(data) => { @@ -381,6 +666,7 @@ async fn perform_clone_operations( local_size: None, final_size: None, error: Some(format!("Task error: {}", e)), + pin: None, }; print_result(&error_result); all_results.push(error_result); @@ -391,134 +677,150 @@ async fn perform_clone_operations( let _ = std::io::stderr().flush(); } } - + Ok(all_results) } +/// Write/update `govbot.lock` from a batch of successful clone/pull results. +/// Non-fatal: a lockfile-write failure prints a warning but does not abort. +fn update_lockfile(project_dir: &std::path::Path, results: &[CloneResult]) { + let mut lock = match LockFile::load_or_default(project_dir) { + Ok(l) => l, + Err(e) => { + eprintln!( + "⚠️ Could not read govbot.lock ({}); skipping pin update", + e + ); + return; + } + }; + let mut pinned = 0usize; + for r in results { + if let Some(pin) = &r.pin { + lock.pin( + &pin.canonical_id, + &pin.git_url, + pin.channel.as_deref(), + &pin.commit, + &pin.cache_key, + ); + pinned += 1; + } + } + if pinned == 0 { + return; + } + match lock.save(project_dir) { + Ok(()) => eprintln!("🔒 Updated govbot.lock ({} datasets pinned)", pinned), + Err(e) => eprintln!("⚠️ Could not write govbot.lock: {}", e), + } +} -async fn run_clone_command(cmd: Command) -> anyhow::Result<()> { - let Command::Clone { +async fn run_pull_command(cmd: Command) -> anyhow::Result<()> { + let Command::Pull { repos, govbot_dir, token, parallel, verbose, list, - } = cmd else { + } = cmd + else { unreachable!() }; + let registry = load_registry()?; + // If --list flag is set, show the list if list { - println!("Available repos:"); - let all_locales = govbot::locale::WorkingLocale::all(); - for locale in all_locales { - println!(" {}", locale.as_lowercase()); + println!("Available datasets:"); + for d in registry.all() { + println!(" {}", d.short_name()); } - println!(" all (clone all repos)"); + println!(" all (pull every dataset)"); return Ok(()); } let repos_dir = get_govbot_dir(govbot_dir)?; - + let proj_dir = project_dir()?; + // Get token from argument or environment variable let env_token = std::env::var("TOKEN").ok(); let token_str = token.as_deref().or(env_token.as_deref()); - + // Get parallelization setting let num_jobs = parallel - .or_else(|| std::env::var("GOVBOT_JOBS").ok().and_then(|s| s.parse().ok())) + .or_else(|| { + std::env::var("GOVBOT_JOBS") + .ok() + .and_then(|s| s.parse().ok()) + }) .unwrap_or(4); - // Parse repos and handle "all" - let mut repos_to_clone = Vec::new(); - - if repos.is_empty() { - // No repos specified: find existing repos to update - // Check all known locales to see which repos exist - let all_locales = govbot::locale::WorkingLocale::all(); - for locale in all_locales { - let locale_str = locale.as_lowercase(); - let repo_name = git::build_repo_name(&locale_str); - let repo_path = repos_dir.join(&repo_name); - - // Check if this is a git repository - if repo_path.exists() && repo_path.join(".git").exists() { - repos_to_clone.push(locale_str.to_string()); - } - } - - if repos_to_clone.is_empty() { - eprintln!("No repos downloaded yet in this directory"); - eprintln!("to download all gov data, do `govbot clone all`. future syncs are just `govbot clone`"); + // Resolve which datasets to pull. + let datasets_to_pull: Vec = if repos.is_empty() { + // No datasets specified: update whatever is already cloned locally. + // A locally-present dataset that is no longer in the registry is + // skipped with a warning rather than aborting the whole update. + let local = git::get_local_datasets(&repos_dir).unwrap_or_default(); + if local.is_empty() { + eprintln!("No datasets downloaded yet in this directory"); + eprintln!("to download all gov data, do `govbot pull all`. future syncs are just `govbot pull`"); return Ok(()); } - - // Create directory if it doesn't exist (needed for the clone operations) - std::fs::create_dir_all(&repos_dir)?; - } else { - // Create directory if it doesn't exist (needed for the clone operations) std::fs::create_dir_all(&repos_dir)?; - - // Parse specified repos - for repo in repos { - let repo = repo.trim().to_lowercase(); - if repo.is_empty() { - continue; - } - - if repo == "all" { - // Add all working locales - let all_locales = govbot::locale::WorkingLocale::all(); - for loc in all_locales { - repos_to_clone.push(loc.as_lowercase().to_string()); - } - } else { - // Validate locale - let _ = govbot::locale::WorkingLocale::from(repo.as_str()); - repos_to_clone.push(repo); + let mut resolved = Vec::new(); + for short in &local { + match registry.resolve(short) { + Ok(d) => resolved.push(d), + Err(_) => eprintln!("⚠️ Skipping '{}' — not in the registry", short), } } - } + resolved + } else { + std::fs::create_dir_all(&repos_dir)?; + registry + .resolve_all(&repos) + .map_err(|e| anyhow::anyhow!("{}", e))? + }; - if repos_to_clone.is_empty() { + if datasets_to_pull.is_empty() { return Ok(()); -} + } // Print initial message with count - eprintln!("🔁 Syncing {} repos\n", repos_to_clone.len()); + eprintln!("🔁 Syncing {} datasets\n", datasets_to_pull.len()); // Perform clone operations and print results as they complete - let results = perform_clone_operations( - repos_to_clone, - repos_dir, - token_str, - num_jobs, - verbose, - ).await?; - + let results = + perform_clone_operations(datasets_to_pull, repos_dir, token_str, num_jobs, verbose).await?; + + // Pin resolved SHAs into govbot.lock for reproducibility. + update_lockfile(&proj_dir, &results); + // Show summary - let errors: Vec<_> = results.iter() - .filter(|r| r.result == "failed") - .collect(); - + let errors: Vec<_> = results.iter().filter(|r| r.result == "failed").collect(); + if !errors.is_empty() { eprintln!("\n❌ Errors occurred: {}/{}", errors.len(), results.len()); } else if !results.is_empty() { - eprintln!("\n✅ Successfully processed all {} repos!", results.len()); + eprintln!( + "\n✅ Successfully processed all {} datasets!", + results.len() + ); } - + Ok(()) } - async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let Command::Delete { locales, govbot_dir, parallel, verbose, - } = cmd else { + } = cmd + else { unreachable!() }; @@ -537,30 +839,35 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { } let repos_dir = get_govbot_dir(govbot_dir)?; - + // Get parallelization setting let num_jobs = parallel - .or_else(|| std::env::var("GOVBOT_JOBS").ok().and_then(|s| s.parse().ok())) + .or_else(|| { + std::env::var("GOVBOT_JOBS") + .ok() + .and_then(|s| s.parse().ok()) + }) .unwrap_or(4); - // Parse locales and handle "all" + // Parse datasets and handle "all". `all` expands to whatever is cloned + // locally — there is nothing to delete that is not on disk. let mut locales_to_delete = Vec::new(); for locale in locales { let locale = locale.trim().to_lowercase(); if locale.is_empty() { continue; } - + if locale == "all" { - // Add all working locales - let all_locales = govbot::locale::WorkingLocale::all(); - for loc in all_locales { - locales_to_delete.push(loc.as_lowercase().to_string()); + for short in git::get_local_datasets(&repos_dir).unwrap_or_default() { + locales_to_delete.push(short); } } else { - // Validate locale - let _ = govbot::locale::WorkingLocale::from(locale.as_str()); - locales_to_delete.push(locale); + // A dataset identifier may be namespaced; delete keys on the short + // (slash-free) name the clone directory uses. + let short = locale.rsplit('/').next().unwrap_or(&locale).to_string(); + let short = short.split('@').next().unwrap_or(&short).to_string(); + locales_to_delete.push(short); } } @@ -575,19 +882,19 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let total = locales_to_delete.len(); let mut deleted_count = 0; let mut failed_count = 0; - + if total == 1 || num_jobs == 1 { // Sequential delete for (idx, locale) in locales_to_delete.iter().enumerate() { - let repo_name = format!("{}-data-pipeline", locale); + let repo_name = git::repo_dir_name(locale); let target_dir = repos_dir.join(&repo_name); - let existed = target_dir.exists(); - + let existed = target_dir.exists() || std::fs::symlink_metadata(&target_dir).is_ok(); + if verbose { eprintln!("[{}/{}] Deleting {}...", idx + 1, total, locale); } - - match git::delete_repo(locale, &repos_dir) { + + match git::delete_dataset(locale, &repos_dir) { Ok(_) => { if existed { eprintln!("{:<4} deleted", locale); @@ -607,7 +914,7 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { use std::sync::{Arc, Mutex}; let deleted = Arc::new(Mutex::new(0usize)); let failed = Arc::new(Mutex::new(0usize)); - + let delete_futures = stream::iter(locales_to_delete.iter()) .map(|locale| { let locale = locale.clone(); @@ -616,20 +923,21 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { let failed = failed.clone(); let total = total; let verbose_flag = verbose; - + tokio::task::spawn_blocking(move || { - let repo_name = format!("{}-data-pipeline", locale); + let repo_name = git::repo_dir_name(&locale); let target_dir = repos_dir.join(&repo_name); - + if verbose_flag { let d = deleted.lock().unwrap(); let f = failed.lock().unwrap(); let current = *d + *f + 1; eprintln!("[{}/{}] Deleting {}...", current, total, locale); } - - let existed = target_dir.exists(); - match git::delete_repo(&locale, &repos_dir) { + + let existed = + target_dir.exists() || std::fs::symlink_metadata(&target_dir).is_ok(); + match git::delete_dataset(&locale, &repos_dir) { Ok(_) => { if existed { let mut d = deleted.lock().unwrap(); @@ -650,7 +958,7 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { .buffer_unordered(num_jobs); let mut stream = delete_futures; - + while let Some(result) = stream.next().await { match result { Ok((locale, Ok(status))) => { @@ -666,11 +974,11 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { } } } - + deleted_count = *deleted.lock().unwrap(); failed_count = *failed.lock().unwrap(); } - + // Show summary if failed_count > 0 { eprintln!("\n❌ Errors occurred: {}/{}", failed_count, total); @@ -679,12 +987,178 @@ async fn run_delete_command(cmd: Command) -> anyhow::Result<()> { } else { eprintln!("\n✅ No repositories found to delete."); } - + Ok(()) } -async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { - let Command::Logs { +/// Collapse a fully-joined `govbot source` entry into the +/// `{"id","text","kind":"docs"}` document the govbot stream protocol defines +/// (`STREAM_PROTOCOL.md` §1) — the record `fastclass classify -` consumes. +/// +/// `id` is the bill's dataset-relative directory path of the form +/// `/country:/state:/sessions//bills/` so the +/// classified result can be routed back to the right *bill* (not session) +/// when `govbot apply` writes it. Two real-world dataset layouts feed into +/// this: +/// +/// 1. **Per-bill log directory** — `sources.log` is already +/// `/.../sessions//bills//logs/.json`. +/// Stripping the `/logs/...` tail yields the bill path directly. +/// 2. **Session-level log directory** (the common case for OCD-files +/// datasets cloned from windycivi) — the on-disk log lives at +/// `/.../sessions//logs/.json` and is a *symlink* +/// to `.../sessions//bills//logs/.json`. The walker +/// reports the symlink path, so stripping `/logs/...` would stop at +/// the *session* and collide every bill in that session onto one id +/// (real bug surfaced by `govbot pull all` over the 55-state corpus: +/// 4916 records collapsed to 97 ids). The fix appends the bill_id +/// whenever the stripped path doesn't already end in `/bills/`. +/// +/// **Bill-id source of truth.** The on-disk bill directory name (e.g. +/// `HB5109`) does **not** always equal the `log.bill_id` field (e.g. +/// `"HB 5109"`). MI/WV/ND/PA logs carry a *display* bill id with a space +/// between the chamber prefix and the number; the actual `bills//` +/// directory has no space. Using `log.bill_id` verbatim produces an `id` +/// like `.../bills/HB 5109` that no `os.path.join(REPOS, doc, +/// "metadata.json")` can resolve. The fix is to take the canonical bill +/// dir name from `sources.bill` (the parent dir of `metadata.json` — the +/// *resolved* on-disk path, set during the `bill` join) whenever +/// available, and fall back to `log.bill_id` only when the bill join is +/// absent. Layout 1 (suffix already present in `sources.log`) is left +/// untouched — that path is itself the canonical on-disk path, so the +/// bill segment is correct by construction. +/// +/// `text` is the **full** bill text assembled from `metadata.json` (not just +/// titles) — the `docs` projection joins the complete bill so this is whole. +/// +/// `subjects` is the **optional** OCD `subject:` array, surfaced as a +/// peer of `text` so a downstream `concept_match` matcher can score against +/// the human-curated controlled vocabulary directly. The field is **omitted +/// entirely** when the bill has no `subject:` (vs. an empty `[]`, which +/// would conflate "no signal" with "explicitly empty") — see +/// `selectors::ocd_files_extract_subjects` and STREAM_PROTOCOL.md §1. +fn ocd_entry_to_doc(entry: &serde_json::Value) -> serde_json::Value { + let bill_id = entry + .get("log") + .and_then(|l| l.get("bill_id").or_else(|| l.get("bill_identifier"))) + .and_then(|v| v.as_str()) + .map(|s| s.to_string()); + + // Canonical on-disk bill directory name, derived from `sources.bill` + // (the path to `metadata.json`, which the bill join resolves to the + // real `bills//metadata.json` on disk — even when the log was a + // session-level symlink). This is the authoritative source for the + // `/bills/` segment because `log.bill_id` may carry a display + // form (e.g. `"HB 5109"`) that differs from the directory (`HB5109`). + let canonical_bill_dir = entry + .get("sources") + .and_then(|s| s.get("bill")) + .and_then(|v| v.as_str()) + .and_then(bill_dir_from_metadata_path) + .map(|s| s.to_string()); + + let stripped = entry + .get("sources") + .and_then(|s| s.get("log")) + .and_then(|v| v.as_str()) + .and_then(|log_path| log_path.split("/logs/").next()) + .map(|s| s.to_string()); + + // Layout 1 still trusts the stripped log path: when `sources.log` + // already ends in `/bills/` that dir name is itself canonical + // (it came from the on-disk walk). Layout 2 must prefer the + // `sources.bill`-derived dir name; only fall back to `log.bill_id` + // when the bill join wasn't requested. + // + // The Layout-1 test must consider BOTH the canonical bill dir (from + // `sources.bill`) AND `log.bill_id`. If we only checked + // `log.bill_id`, then MI/WV/ND/PA — whose log carries `"HB 0163"` + // but on-disk dir is `HB0163` — would fail the Layout-1 test even + // when `sources.log` already ends in `/bills/HB0163`, and we'd + // double-append, producing `.../bills/HB0163/bills/HB0163`. + let id = match stripped { + Some(path) => { + let already_ends_in_bill_dir = canonical_bill_dir + .as_deref() + .map(|d| path.ends_with(&format!("/bills/{}", d))) + .unwrap_or(false) + || bill_id + .as_deref() + .map(|d| path.ends_with(&format!("/bills/{}", d))) + .unwrap_or(false); + if already_ends_in_bill_dir { + // Layout 1: log lived under bills//logs/. The stripped + // path is already the canonical bill dir. + path + } else if let Some(canon) = canonical_bill_dir.as_deref() { + // Layout 2 (preferred): use the on-disk dir name from the + // resolved metadata.json path, so display-form bill ids + // with whitespace (e.g. `"HB 5109"`) don't bleed into the + // doc id and break sibling-file lookups. + format!("{}/bills/{}", path, canon) + } else if let Some(bid) = bill_id.as_deref() { + // Layout 2 fallback: no bill join, so the best we have is + // the log's `bill_id`. This may be a display form; callers + // doing path lookups should treat it as advisory. + format!("{}/bills/{}", path, bid) + } else { + path + } + } + None => canonical_bill_dir.or(bill_id).unwrap_or_else(String::new), + }; + let mut out = serde_json::Map::new(); + out.insert("id".to_string(), serde_json::Value::String(id)); + out.insert( + "text".to_string(), + serde_json::Value::String(ocd_files_select_default(entry)), + ); + out.insert( + "kind".to_string(), + serde_json::Value::String("docs".to_string()), + ); + // Optional `subjects:` — only emitted when the bill actually carries one + // or more non-empty OCD `subject:` entries. `None` is the unambiguous + // "no signal" form; we never emit `"subjects": []`. + if let Some(subjects) = ocd_files_extract_subjects(entry) { + out.insert( + "subjects".to_string(), + serde_json::Value::Array( + subjects + .into_iter() + .map(serde_json::Value::String) + .collect(), + ), + ); + } + serde_json::Value::Object(out) +} + +/// Given a `sources.bill` path (`<...>/bills//metadata.json`, +/// possibly with `..` prefixes from a cache-symlinked repo), return the +/// `` segment — the canonical on-disk bill directory name. Returns +/// `None` if the path doesn't end in `bills//metadata.json`. +fn bill_dir_from_metadata_path(metadata_path: &str) -> Option<&str> { + // Strip the trailing filename. + let without_file = metadata_path.strip_suffix("/metadata.json")?; + // Take the last path segment — that's the bill dir. + let last_slash = without_file.rfind('/')?; + let dir = &without_file[last_slash + 1..]; + // Sanity check: the segment before that should be `bills`. If not, + // the path doesn't look like a bill metadata path; refuse to guess. + let before_dir = &without_file[..last_slash]; + if !before_dir.ends_with("/bills") && before_dir != "bills" { + return None; + } + if dir.is_empty() { + None + } else { + Some(dir) + } +} + +async fn run_source_command(cmd: Command) -> anyhow::Result<()> { + let Command::Source { govbot_dir, repos, sort: _sort, @@ -692,10 +1166,11 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { join, select, filter, - } = cmd else { + } = cmd + else { unreachable!() }; - + // Parse join options - now supports field paths like "bill.title" and special "tags" let mut join_specs: Vec<(String, Vec)> = Vec::new(); let mut join_tags = false; @@ -715,7 +1190,11 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { let limit_parsed: Option = if limit.to_lowercase() == "none" { None } else { - Some(limit.parse().map_err(|e| anyhow::anyhow!("Invalid limit value '{}': {}", limit, e))?) + Some( + limit + .parse() + .map_err(|e| anyhow::anyhow!("Invalid limit value '{}': {}", limit, e))?, + ) }; // Parse comma-separated repos if provided as single string @@ -734,32 +1213,29 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { repo_list.push("all".to_string()); } - // Expand "all" to existing repos in the directory, or convert locale names to repo names + // Expand "all" to the datasets cloned in the directory, or map dataset + // identifiers to their on-disk repo directory names. let mut repos_to_process = Vec::new(); for locale in repo_list { let locale = locale.trim().to_lowercase(); if locale.is_empty() { continue; } - + if locale == "all" { - // Find all existing repos in the directory + // Every dataset cloned locally — registry membership is not + // required here, only on-disk presence. if git_dir.exists() { - let all_locales = govbot::locale::WorkingLocale::all(); - for loc in all_locales { - let locale_str = loc.as_lowercase(); - let repo_name = git::build_repo_name(&locale_str); - let repo_path = git_dir.join(&repo_name); - - // Only add repos that actually exist (for logs, we don't need .git, just the directory) - if repo_path.exists() && repo_path.is_dir() { - repos_to_process.push(repo_name); - } + for short in git::get_local_datasets(&git_dir).unwrap_or_default() { + repos_to_process.push(git::repo_dir_name(&short)); } } } else { - // Convert locale name to repo name using build_repo_name - repos_to_process.push(git::build_repo_name(&locale)); + // A dataset identifier may be namespaced; the clone directory is + // keyed on the short (slash-free) name. + let short = locale.rsplit('/').next().unwrap_or(&locale); + let short = short.split('@').next().unwrap_or(short); + repos_to_process.push(git::repo_dir_name(short)); } } @@ -771,8 +1247,11 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { // Process each repo (with optional filtering) for repo_name in repos_to_process { + // A project's repo entry may be a symlink into the shared dataset + // cache. The walker reads through it transparently and reports child + // paths under `git_dir`, so `sources.log` stays project-relative. let repo_path = git_dir.join(&repo_name); - + if !repo_path.exists() { eprintln!("Warning: Repository not found: {}", repo_path.display()); continue; @@ -781,7 +1260,7 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { // Walk the repo directory to find log files matching the pattern: // repo_name/country:{country}/state:{state}/sessions/{session_name}/logs/*.json let mut file_count = 0; - + for entry_result in WalkDir::new(&repo_path) .process_read_dir(|_depth, _path, _read_dir_state, _children| { // Optional: customize directory reading behavior @@ -801,7 +1280,7 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { } let path = entry.path(); - + // Check if it's a JSON file in a logs directory if !path.is_file() { continue; @@ -814,7 +1293,7 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { // Check if path matches: country:{country}/state:{state}/sessions/{session_name}/logs/*.json let path_str = path.to_string_lossy(); let repo_prefix = repo_path.to_string_lossy(); - + // Get relative path by stripping the repo prefix // Handle both absolute and relative paths let relative_path = if let Some(stripped) = path_str.strip_prefix(&*repo_prefix) { @@ -824,11 +1303,11 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { // If prefix doesn't match, skip this file continue; }; - + // Match pattern: country:*/state:*/sessions/*/logs/*.json // Use a simple regex-like check: must have these components in order - if relative_path.starts_with("country:") - && relative_path.contains("/state:") + if relative_path.starts_with("country:") + && relative_path.contains("/state:") && relative_path.contains("/sessions/") && relative_path.contains("/logs/") && relative_path.ends_with(".json") @@ -838,12 +1317,12 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { let state_pos = relative_path.find("/state:").unwrap_or(usize::MAX); let sessions_pos = relative_path.find("/sessions/").unwrap_or(usize::MAX); let logs_pos = relative_path.find("/logs/").unwrap_or(usize::MAX); - + // Verify order: country < state < sessions < logs if country_pos < state_pos && state_pos < sessions_pos && sessions_pos < logs_pos { // Compute relative source path let source_path_str = compute_relative_source_path(&path, &git_dir); - + // Read JSON file, parse it, and build extensible output structure match fs::read_to_string(&path) { Ok(contents) => { @@ -857,19 +1336,22 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { .or_else(|| json_value.get("bill_identifier")) .and_then(|id| id.as_str()) .map(|s| s.to_string()); - + // Build output with extensible structure: // - Data keys (log, bill, etc.) are singular entity names matching source keys // - sources object automatically tracks all data sources let mut output = serde_json::Map::new(); - + // Add the log data with key "log" (matching sources.log) output.insert("log".to_string(), json_value); - + // Add sources with the log path let mut sources = serde_json::Map::new(); - sources.insert("log".to_string(), serde_json::Value::String(source_path_str.clone())); - + sources.insert( + "log".to_string(), + serde_json::Value::String(source_path_str.clone()), + ); + // Join additional datasets if requested for (dataset_name, field_path) in &join_specs { match dataset_name.as_str() { @@ -881,36 +1363,59 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { Ok(p) => p, Err(_) => path.clone(), }; - - let metadata_path = canonical_log_path.parent() + + let metadata_path = canonical_log_path + .parent() .and_then(|logs_dir| { logs_dir.parent().map(|bill_dir| { bill_dir.join("metadata.json") }) }); - + if let Some(ref metadata_path) = metadata_path { if metadata_path.exists() { match fs::read_to_string(metadata_path) { Ok(metadata_contents) => { - match serde_json::from_str::(&metadata_contents) { + match serde_json::from_str::< + serde_json::Value, + >( + &metadata_contents + ) { Ok(metadata_value) => { // If field_path is specified, extract just that field // Otherwise, include the full bill data if field_path.is_empty() { // No field path specified, include full bill data - output.insert("bill".to_string(), metadata_value); + output.insert( + "bill".to_string(), + metadata_value, + ); } else { // Extract specific field(s) from bill data - if let Some(field_value) = extract_json_field(&metadata_value, field_path) { + if let Some( + field_value, + ) = + extract_json_field( + &metadata_value, + field_path, + ) + { // Use the full join path as the key (e.g., "bill.title") - let output_key = format!("{}.{}", dataset_name, field_path.join(".")); - output.insert(output_key, field_value); + let output_key = format!( + "{}.{}", + dataset_name, + field_path + .join(".") + ); + output.insert( + output_key, + field_value, + ); } else { eprintln!("Warning: Field path {:?} not found in metadata from {}", field_path, metadata_path.display()); } } - + // Add bill source path let bill_source_path = compute_relative_source_path(metadata_path, &git_dir); sources.insert("bill".to_string(), serde_json::Value::String(bill_source_path)); @@ -932,97 +1437,144 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { } } _ => { - eprintln!("Warning: Unknown join dataset: {}", dataset_name); + eprintln!( + "Warning: Unknown join dataset: {}", + dataset_name + ); } } } - - // Join tags if requested + + // Join tags if requested. + // + // `.govbot/` is the tool's cache — tag + // files no longer live inside it. The + // primary lookup is the project-rooted + // `/tags//...` layout + // `govbot apply` writes today. Two + // read-only fallbacks stay live for + // migration: the in-cache `/ + // tags/` location Bug 6 added, and the + // cwd-rooted `country:.../sessions// + // tags/` layout that pre-dates Bug 1. + // First non-empty match wins; an empty + // result on every candidate is silent. if join_tags { - // Extract country, state, session_id from the path - if let Some((country, state, session_id)) = extract_path_info(&source_path_str) { - // Use bill_id extracted earlier - if let Some(ref bill_id) = bill_id_opt { - // Look for tags in cwd/country:us/state:{state}/sessions/{session_id}/tags/ - let cwd = std::env::current_dir().unwrap_or_else(|_| PathBuf::from(".")); - let tags_dir = cwd - .join(&format!("country:{}", country)) - .join(&format!("state:{}", state)) - .join("sessions") - .join(&session_id) - .join("tags"); - - if tags_dir.exists() && tags_dir.is_dir() { - let mut matched_tags = serde_json::Map::new(); - if let Ok(entries) = fs::read_dir(&tags_dir) { - for entry in entries.flatten() { - let path = entry.path(); - // Check for both .tag.json and .json files - if let Some(ext) = path.extension().and_then(|s| s.to_str()) { - if ext == "json" { - if let Some(stem) = path.file_stem().and_then(|s| s.to_str()) { - // Remove .tag suffix if present (e.g., "budget.tag" -> "budget") - let tag_name = stem.strip_suffix(".tag").unwrap_or(stem); - match fs::read_to_string(&path) { - Ok(contents) => { - if let Ok(tag_file) = serde_json::from_str::(&contents) { - // Check if bill_id exists in bills map - if let Some(bill_result) = tag_file.bills.get(bill_id) { - // Return the score breakdown - matched_tags.insert(tag_name.to_string(), serde_json::to_value(&bill_result.score).unwrap_or(serde_json::Value::Null)); - } - } - } - Err(_) => {} - } - } - } - } - } - } - if !matched_tags.is_empty() { - output.insert("tags".to_string(), serde_json::Value::Object(matched_tags)); - } + if let Some(ref bill_id) = bill_id_opt { + let mut matched_tags: serde_json::Map< + String, + serde_json::Value, + > = serde_json::Map::new(); + + let cwd = std::env::current_dir() + .unwrap_or_else(|_| PathBuf::from(".")); + for candidate in + resolve_tags_dir_candidates(&path, &cwd) + { + matched_tags = + match_tags_in_dir(&candidate, bill_id); + if !matched_tags.is_empty() { + break; + } + } + + // Final fallback: pre-Bug-1 + // cwd-rooted layout. Only + // consulted when the dataset- + // aware candidates all came up + // empty. + if matched_tags.is_empty() { + if let Some((country, state, session_id)) = + extract_path_info(&source_path_str) + { + let legacy_tags_dir = cwd + .join(format!("country:{}", country)) + .join(format!("state:{}", state)) + .join("sessions") + .join(&session_id) + .join("tags"); + matched_tags = match_tags_in_dir( + &legacy_tags_dir, + bill_id, + ); } } + + if !matched_tags.is_empty() { + output.insert( + "tags".to_string(), + serde_json::Value::Object(matched_tags), + ); + } } } - - output.insert("sources".to_string(), serde_json::Value::Object(sources)); - + + output.insert( + "sources".to_string(), + serde_json::Value::Object(sources), + ); + // Extract timestamp from sources.log path (after "logs/" and before "_") // Do this after sources is inserted so we can use the final sources.log value let timestamp = extract_timestamp_from_path(&source_path_str); if let Some(ref ts) = timestamp { - output.insert("timestamp".to_string(), serde_json::Value::String(ts.clone())); + output.insert( + "timestamp".to_string(), + serde_json::Value::String(ts.clone()), + ); } - + let mut output_value = serde_json::Value::Object(output); - - // Apply select transformation if requested + + // Apply select transformation if requested. + // `default` trims each entry to the familiar + // title/abstracts/subject shape. `docs` deliberately + // does NOT trim — it keeps the full joined `bill` + // (the whole metadata.json) so the {id,text,kind} + // document carries the FULL bill text per + // STREAM_PROTOCOL §1. The collapse to {id,text,kind} + // happens after the entry survives the filter. if select == "default" { // Select specific keys from nested objects, preserving structure let mut selected_output = serde_json::Map::new(); - + // Top: id (from log.bill_id), then log object with selected fields - if let Some(id) = output_value.get("log").and_then(|l| l.get("bill_id").or_else(|| l.get("bill_identifier"))).and_then(|v| v.as_str()) { - selected_output.insert("id".to_string(), serde_json::Value::String(id.to_string())); + if let Some(id) = output_value + .get("log") + .and_then(|l| { + l.get("bill_id") + .or_else(|| l.get("bill_identifier")) + }) + .and_then(|v| v.as_str()) + { + selected_output.insert( + "id".to_string(), + serde_json::Value::String(id.to_string()), + ); } - + // Create log object with only action and bill_id if let Some(log) = output_value.get("log") { let mut log_obj = serde_json::Map::new(); if let Some(action) = log.get("action") { - log_obj.insert("action".to_string(), action.clone()); + log_obj + .insert("action".to_string(), action.clone()); } - if let Some(bill_id) = log.get("bill_id").or_else(|| log.get("bill_identifier")) { - log_obj.insert("bill_id".to_string(), bill_id.clone()); + if let Some(bill_id) = log + .get("bill_id") + .or_else(|| log.get("bill_identifier")) + { + log_obj + .insert("bill_id".to_string(), bill_id.clone()); } if !log_obj.is_empty() { - selected_output.insert("log".to_string(), serde_json::Value::Object(log_obj)); + selected_output.insert( + "log".to_string(), + serde_json::Value::Object(log_obj), + ); } } - + // Create bill object with only selected fields if let Some(bill) = output_value.get("bill") { let mut bill_obj = serde_json::Map::new(); @@ -1030,54 +1582,85 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { bill_obj.insert("title".to_string(), title.clone()); } if let Some(abstracts) = bill.get("abstracts") { - bill_obj.insert("abstracts".to_string(), abstracts.clone()); + bill_obj.insert( + "abstracts".to_string(), + abstracts.clone(), + ); } if let Some(subject) = bill.get("subject") { - bill_obj.insert("subject".to_string(), subject.clone()); + bill_obj + .insert("subject".to_string(), subject.clone()); } if let Some(identifier) = bill.get("identifier") { - bill_obj.insert("identifier".to_string(), identifier.clone()); + bill_obj.insert( + "identifier".to_string(), + identifier.clone(), + ); } if let Some(session) = bill.get("legislative_session") { - bill_obj.insert("legislative_session".to_string(), session.clone()); + bill_obj.insert( + "legislative_session".to_string(), + session.clone(), + ); } if let Some(org) = bill.get("from_organization") { - bill_obj.insert("from_organization".to_string(), org.clone()); + bill_obj.insert( + "from_organization".to_string(), + org.clone(), + ); } if !bill_obj.is_empty() { - selected_output.insert("bill".to_string(), serde_json::Value::Object(bill_obj)); + selected_output.insert( + "bill".to_string(), + serde_json::Value::Object(bill_obj), + ); } } - + // Always include tags (even if empty/null) since it's part of the default selector if let Some(tags) = output_value.get("tags") { - selected_output.insert("tags".to_string(), tags.clone()); + selected_output + .insert("tags".to_string(), tags.clone()); } else { // Include empty tags object if not present - selected_output.insert("tags".to_string(), serde_json::Value::Null); + selected_output.insert( + "tags".to_string(), + serde_json::Value::Null, + ); } - + // Bottom: sources, timestamp if let Some(sources) = output_value.get("sources") { - selected_output.insert("sources".to_string(), sources.clone()); + selected_output + .insert("sources".to_string(), sources.clone()); } if let Some(timestamp) = output_value.get("timestamp") { - selected_output.insert("timestamp".to_string(), timestamp.clone()); + selected_output + .insert("timestamp".to_string(), timestamp.clone()); } - + output_value = serde_json::Value::Object(selected_output); } - + // Apply filter - let should_output = match filter_manager.should_keep(&output_value, &repo_name) { + let should_output = match filter_manager + .should_keep(&output_value, &repo_name) + { govbot::FilterResult::Keep => true, govbot::FilterResult::FilterOut => false, }; - + if should_output { + // `docs` mode: collapse the surviving entry to the + // {id,text} document shape fastclass consumes. + let output_value = if select == "docs" { + ocd_entry_to_doc(&output_value) + } else { + output_value + }; // Deep prune empty/null values before serialization let pruned_value = deep_prune_json(output_value); - + // Serialize as compact JSON (single line) match serde_json::to_string(&pruned_value) { Ok(json_line) => { @@ -1087,7 +1670,11 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { } } Err(e) => { - eprintln!("Error serializing JSON from {}: {}", path.display(), e); + eprintln!( + "Error serializing JSON from {}: {}", + path.display(), + e + ); } } } @@ -1109,28 +1696,30 @@ async fn run_logs_command(cmd: Command) -> anyhow::Result<()> { Ok(()) } - /// Parse a join string like "bill.title" into (dataset_name, field_path) fn parse_join_string(join_str: &str) -> Option<(String, Vec)> { let parts: Vec<&str> = join_str.split('.').collect(); if parts.is_empty() { return None; } - + let dataset_name = parts[0].to_string(); let field_path = if parts.len() > 1 { parts[1..].iter().map(|s| s.to_string()).collect() } else { Vec::new() }; - + Some((dataset_name, field_path)) } /// Extract a value from JSON using a field path (e.g., ["title"] or ["bill", "title"]) -fn extract_json_field(value: &serde_json::Value, field_path: &[String]) -> Option { +fn extract_json_field( + value: &serde_json::Value, + field_path: &[String], +) -> Option { let mut current = value; - + for field in field_path { match current { serde_json::Value::Object(map) => { @@ -1146,7 +1735,7 @@ fn extract_json_field(value: &serde_json::Value, field_path: &[String]) -> Optio _ => return None, } } - + Some(current.clone()) } @@ -1197,45 +1786,50 @@ fn deep_prune_json(value: serde_json::Value) -> serde_json::Value { /// Extract timestamp from a path string (after "logs/" and before "_") /// Example: "path/to/logs/20250121T000000Z_filename.json" -> "20250121T000000Z" fn extract_timestamp_from_path(path: &str) -> Option { - // Find the position of "/logs/" - if let Some(logs_pos) = path.find("/logs/") { - // Get the substring after "/logs/" - let after_logs = &path[logs_pos + 6..]; - // Find the position of "_" after "logs/" - if let Some(underscore_pos) = after_logs.find('_') { - // Extract the timestamp (between "logs/" and "_") - let timestamp = &after_logs[..underscore_pos]; - if !timestamp.is_empty() { - return Some(timestamp.to_string()); - } - } + // OCD-files log filenames take two shapes: action-named entries use + // `_.json` (e.g. `20250129T022703Z_bill_number_assigned.json`) + // and OCD-classification entries use `.classification.<...>.json` + // (e.g. `20250131T030931Z.classification.introduction.lower.json`). + // The action-based filter (`--filter default`) drops the latter, so the + // `_`-only extractor used to be sufficient; once `--filter none` became + // the `govbot logs` default for Frankie back-compat, the `.`-separated + // entries flow through and need their timestamp projected too. + let logs_pos = path.find("/logs/")?; + let after_logs = &path[logs_pos + 6..]; + let separator_pos = after_logs.find(|c: char| c == '_' || c == '.')?; + let timestamp = &after_logs[..separator_pos]; + if timestamp.is_empty() { + None + } else { + Some(timestamp.to_string()) } - None } -/// Compute relative path from git_dir to a file, following symlinks +/// Compute the relative path from `git_dir` to a walked file. +/// +/// Files are walked as `git_dir//...` — including through a `` +/// symlink into the shared dataset cache — so the direct (non-canonicalized) +/// diff is what keeps `sources.log` project-relative. Canonicalizing here +/// would resolve a cached dataset to `~/.govbot/cache/...` and escape +/// `git_dir`; it is used only as a last-resort fallback. fn compute_relative_source_path(file_path: &PathBuf, git_dir: &PathBuf) -> String { - // Canonicalize the file path to follow symlinks - let canonical_file = match file_path.canonicalize() { - Ok(p) => p, - Err(_) => file_path.clone(), - }; - - // Canonicalize git_dir for proper relative path calculation - let canonical_git_dir = match git_dir.canonicalize() { - Ok(p) => p, - Err(_) => git_dir.clone(), - }; - - // Get relative path from git_dir to the file + // Preferred: the path as walked, relative to git_dir. + if let Some(rel) = pathdiff::diff_paths(file_path, git_dir) { + if !rel.starts_with("..") { + return rel.to_string_lossy().replace('\\', "/"); + } + } + + // Fallback: canonicalize both ends and diff. + let canonical_file = file_path + .canonicalize() + .unwrap_or_else(|_| file_path.clone()); + let canonical_git_dir = git_dir.canonicalize().unwrap_or_else(|_| git_dir.clone()); match pathdiff::diff_paths(&canonical_file, &canonical_git_dir) { Some(rel_path) => rel_path.to_string_lossy().replace('\\', "/"), - None => { - // Fallback: use path relative to git_dir directly - pathdiff::diff_paths(file_path, git_dir) - .map(|p| p.to_string_lossy().replace('\\', "/")) - .unwrap_or_else(|| file_path.to_string_lossy().replace('\\', "/")) - } + None => pathdiff::diff_paths(file_path, git_dir) + .map(|p| p.to_string_lossy().replace('\\', "/")) + .unwrap_or_else(|| file_path.to_string_lossy().replace('\\', "/")), } } @@ -1245,7 +1839,8 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { govbot_dir, memory_limit, threads, - } = cmd else { + } = cmd + else { unreachable!() }; @@ -1253,23 +1848,25 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { // Check if directory exists if !repos_dir.exists() { - eprintln!("Error: Govbot repos directory not found: {}", repos_dir.display()); - eprintln!("Run 'govbot clone all' first to clone repositories."); + eprintln!( + "Error: Govbot repos directory not found: {}", + repos_dir.display() + ); + eprintln!("Run 'govbot pull all' first to pull datasets."); return Ok(()); } // Get base govbot directory (parent of repos) // e.g., if repos_dir is ./.govbot/repos, base_dir is ./.govbot - let base_govbot_dir = repos_dir.parent() + let base_govbot_dir = repos_dir + .parent() .ok_or_else(|| anyhow::anyhow!("Could not determine base govbot directory"))?; - + // Ensure base directory exists std::fs::create_dir_all(base_govbot_dir)?; // Check if duckdb is available - let duckdb_check = ProcessCommand::new("duckdb") - .arg("--version") - .output(); + let duckdb_check = ProcessCommand::new("duckdb").arg("--version").output(); if duckdb_check.is_err() { eprintln!("Error: 'duckdb' command not found."); @@ -1279,7 +1876,8 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { // Database file goes in the base govbot directory // Resolve to absolute path to ensure it's created in the right location - let db_path = base_govbot_dir.canonicalize() + let db_path = base_govbot_dir + .canonicalize() .unwrap_or_else(|_| base_govbot_dir.to_path_buf()) .join(&database); let db_path_str = db_path.to_string_lossy().to_string(); @@ -1322,7 +1920,10 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { sql_script.push_str("SELECT \n"); sql_script.push_str(" *,\n"); sql_script.push_str(" filename as source_file\n"); - sql_script.push_str(&format!("FROM read_json_auto('{}/**/bills/*/metadata.json', \n", repos_dir_str)); + sql_script.push_str(&format!( + "FROM read_json_auto('{}/**/bills/*/metadata.json', \n", + repos_dir_str + )); sql_script.push_str(" filename=true, \n"); sql_script.push_str(" union_by_name=true);\n"); sql_script.push_str("\n"); @@ -1354,7 +1955,7 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { duckdb_cmd.stderr(std::process::Stdio::piped()); let mut child = duckdb_cmd.spawn()?; - + // Write SQL to stdin if let Some(mut stdin) = child.stdin.take() { stdin.write_all(sql_script.as_bytes())?; @@ -1393,735 +1994,563 @@ async fn run_load_command(cmd: Command) -> anyhow::Result<()> { fn extract_path_info(path: &str) -> Option<(String, String, String)> { // Find country: pattern let country_start = path.find("country:")?; - let country_end = path[country_start + 8..].find('/').unwrap_or(path.len() - country_start - 8); + let country_end = path[country_start + 8..] + .find('/') + .unwrap_or(path.len() - country_start - 8); let country = path[country_start + 8..country_start + 8 + country_end].to_string(); - + // Find state: pattern let state_start = path.find("/state:")?; - let state_end = path[state_start + 7..].find('/').unwrap_or(path.len() - state_start - 7); + let state_end = path[state_start + 7..] + .find('/') + .unwrap_or(path.len() - state_start - 7); let state = path[state_start + 7..state_start + 7 + state_end].to_string(); - + // Find sessions/ pattern let sessions_start = path.find("/sessions/")?; - let session_end = path[sessions_start + 10..].find('/').unwrap_or(path.len() - sessions_start - 10); + let session_end = path[sessions_start + 10..] + .find('/') + .unwrap_or(path.len() - sessions_start - 10); let session_id = path[sessions_start + 10..sessions_start + 10 + session_end].to_string(); - + Some((country, state, session_id)) } -/// Download a file from a URL to a local path -fn download_file(url: &str, path: &std::path::Path) -> anyhow::Result<()> { - eprintln!("Downloading {}...", url); - let response = reqwest::blocking::get(url)?; - if !response.status().is_success() { - return Err(anyhow::anyhow!("Failed to download {}: HTTP {}", url, response.status())); - } - let mut file = std::fs::File::create(path)?; - std::io::copy(&mut response.bytes()?.as_ref(), &mut file)?; - Ok(()) +/// The session directory of a log file path — the ancestor whose immediate +/// child is `bills/` — together with the path segments that uniquely place it +/// inside its dataset. +/// +/// Why pulled out: `resolve_tags_dir` needs the path twice, once to look at +/// the project-rooted `tags//...` layout and once for the in-cache +/// `/tags/` fallback. Computing it in one place keeps both lookups +/// in sync with the canonical dataset layout. +struct SessionAnchor { + /// The session directory itself (the `bills/`-bearing ancestor). + session_dir: PathBuf, + /// The dataset's `short_name` — the first path segment under the repos + /// dir (e.g. `wy-legislation`). `None` if the path is not inside a + /// recognisable `//country:.../sessions/...` layout, in + /// which case the project-rooted lookup is skipped. + dataset: Option, + /// The `country:` segment as-is (e.g. `country:us`). + country_segment: String, + /// The `state:` segment as-is (e.g. `state:wy`). + state_segment: String, + /// The session id (the segment after `sessions/`). + session_id: String, } -/// Ensure embedding model and tokenizer exist; if missing, download them from Hugging Face. -/// Returns true if files are present/ready, false otherwise. -fn ensure_embedding_files(model_dir: &std::path::Path) -> bool { - let model_path = model_dir.join("model.onnx"); - let tokenizer_path = model_dir.join("tokenizer.json"); - let _vocab_path = model_dir.join("vocab.txt"); - - if model_path.exists() && tokenizer_path.exists() { - return true; +/// Walk up from `log_path` to its session directory (the `bills/`-bearing +/// ancestor) and capture every segment needed to plant a tag file under +/// `/tags//country:.../state:.../sessions//`. Returns +/// `None` when the path is not inside the canonical dataset layout. +fn parse_session_anchor(log_path: &Path) -> Option { + let mut cursor = log_path.parent(); + while let Some(dir) = cursor { + if dir.join("bills").is_dir() { + // Found the session dir. Walk *down* its components to recover + // the dataset short_name and jurisdiction segments — they are + // the same segments `parse_doc_route` extracts on the writer + // side, so the two halves stay symmetric. + let mut country_segment: Option = None; + let mut state_segment: Option = None; + let mut session_id: Option = None; + let mut dataset: Option = None; + let mut prev_was_sessions = false; + let mut country_seen = false; + for component in dir.components() { + let seg = component.as_os_str().to_string_lossy().to_string(); + if seg.starts_with("country:") { + country_segment = Some(seg.clone()); + country_seen = true; + } else if seg.starts_with("state:") { + state_segment = Some(seg.clone()); + } else if seg == "sessions" { + prev_was_sessions = true; + continue; + } else if prev_was_sessions { + session_id = Some(seg.clone()); + } + // The dataset short_name is the path segment immediately + // before the first `country:` segment. For typical layouts + // (`//country:.../...`) that is one segment; + // we only need the most recent non-pathy segment before + // `country:` was first seen. + if !country_seen + && !seg.is_empty() + && seg != "/" + && !seg.starts_with("country:") + && !seg.starts_with("state:") + && seg != "sessions" + && seg != "bills" + { + dataset = Some(seg); + } + prev_was_sessions = false; + } + return Some(SessionAnchor { + session_dir: dir.to_path_buf(), + dataset, + country_segment: country_segment?, + state_segment: state_segment?, + session_id: session_id?, + }); + } + cursor = dir.parent(); } + None +} - eprintln!("Embedding files not found. Downloading all-MiniLM-L6-v2 (ONNX) to {}...", model_dir.display()); - - // Use Xenova ONNX exports - let onnx_url = "https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/onnx/model.onnx"; - let tokenizer_url = "https://huggingface.co/Xenova/all-MiniLM-L6-v2/resolve/main/tokenizer.json"; - - // Download tokenizer.json - if !tokenizer_path.exists() { - if let Err(e) = download_file(tokenizer_url, &tokenizer_path) { - eprintln!("Failed to download tokenizer.json: {}", e); - return false; +/// Resolve every `tags/`-equivalent directory we are willing to read a tag +/// file from, in the order the caller should consult them. +/// +/// `.govbot/` is the tool's cache (the `node_modules/` equivalent) — tag +/// files belong outside it, in a project-rooted classification-output dir. +/// The primary lookup is therefore `/tags//country:.../ +/// state:.../sessions//`. Two fallbacks stay live for migration: +/// +/// 1. **Primary**: `/tags//country:.../sessions//` +/// — where `govbot apply` writes today. +/// 2. **Fallback A** (Bug 6 / `6cbb12e`): the in-cache +/// `/tags/` sibling-of-`bills/` — kept read-only so a +/// working tree mid-migration still resolves. +/// 3. **Fallback B** (pre-Bug-1): the cwd-rooted +/// `/country:.../state:.../sessions//tags/` — kept for layouts +/// that pre-date the dataset-rooted move (and for explicit +/// `--output-dir` overrides that landed there). +/// +/// The chain is read-only — `apply` itself never touches anything but the +/// primary location. +fn resolve_tags_dir_candidates(log_path: &Path, project_dir: &Path) -> Vec { + let mut candidates: Vec = Vec::new(); + if let Some(anchor) = parse_session_anchor(log_path) { + // Primary: /tags//country:.../state:.../sessions// + if let Some(ref dataset) = anchor.dataset { + candidates.push( + project_dir + .join("tags") + .join(dataset) + .join(&anchor.country_segment) + .join(&anchor.state_segment) + .join("sessions") + .join(&anchor.session_id), + ); } + // Fallback A: in-cache session/tags/ (Bug 6 layout, read-only). + candidates.push(anchor.session_dir.join("tags")); } + candidates +} - // Download ONNX model - if !model_path.exists() { - if let Err(e) = download_file(onnx_url, &model_path) { - eprintln!("Failed to download ONNX model: {}", e); - return false; +/// Read every `*.json` / `*.tag.json` file in `tags_dir`, parse each as a +/// `TagFile`, and return the subset whose `bills` map contains `bill_id`, +/// keyed by tag name (file stem with any `.tag` suffix stripped). Returns an +/// empty map if `tags_dir` does not exist or contains no matching tags. +/// +/// Pulled out so the same logic serves the dataset-rooted lookup *and* the +/// project-root fallback below without duplication. +fn match_tags_in_dir(tags_dir: &Path, bill_id: &str) -> serde_json::Map { + let mut matched = serde_json::Map::new(); + if !tags_dir.is_dir() { + return matched; + } + let entries = match fs::read_dir(tags_dir) { + Ok(e) => e, + Err(_) => return matched, + }; + for entry in entries.flatten() { + let path = entry.path(); + if path.extension().and_then(|s| s.to_str()) != Some("json") { + continue; + } + let stem = match path.file_stem().and_then(|s| s.to_str()) { + Some(s) => s, + None => continue, + }; + // `budget.tag.json` -> `budget`; plain `budget.json` -> `budget`. + let tag_name = stem.strip_suffix(".tag").unwrap_or(stem); + let contents = match fs::read_to_string(&path) { + Ok(c) => c, + Err(_) => continue, + }; + let tag_file: govbot::TagFile = match serde_json::from_str(&contents) { + Ok(t) => t, + Err(_) => continue, + }; + if let Some(bill_result) = tag_file.bills.get(bill_id) { + matched.insert( + tag_name.to_string(), + serde_json::to_value(&bill_result.score).unwrap_or(serde_json::Value::Null), + ); } } + matched +} - if !model_path.exists() || !tokenizer_path.exists() { - eprintln!( - "Download completed but model.onnx or tokenizer.json not found in {}", - model_dir.display() - ); - return false; - } +/// The slice of a `fastclass classify` result that `govbot apply` consumes. +/// Unknown fields are ignored, so fastclass may evolve its output freely. +#[derive(serde::Deserialize)] +struct FastclassResult { + doc: String, + #[serde(default)] + text_hash: String, + #[serde(default)] + tags: HashMap, +} - eprintln!("✅ Successfully downloaded embedding files!"); - true +#[derive(serde::Deserialize)] +struct FastclassTag { + #[serde(default)] + matched: bool, + #[serde(default)] + fusion: FastclassFusion, } -/// Tag result structure: (tag_key, score_breakdown) -type TagResult = (String, govbot::ScoreBreakdown); - -/// Check if a bill is already tagged in tag file(s) for the given session -/// If tag_name is Some, only checks that specific tag file -/// Returns a list of tag names that contain this bill -fn check_existing_tags( - tags_dir: &PathBuf, - bill_id: &str, - tag_name: Option<&str>, -) -> anyhow::Result> { - let mut matched_tags = Vec::new(); - - if !tags_dir.exists() { - return Ok(matched_tags); - } - - // If a specific tag is requested, only check that tag file - if let Some(requested_tag) = tag_name { - let tag_path = tags_dir.join(format!("{}.tag.json", requested_tag)); - if tag_path.exists() { - match fs::read_to_string(&tag_path) { - Ok(contents) => { - if let Ok(tag_file) = serde_json::from_str::(&contents) { - if tag_file.bills.contains_key(bill_id) { - matched_tags.push(requested_tag.to_string()); - } - } - } - Err(_) => { - // Tag file exists but can't be read - return empty - } - } +#[derive(serde::Deserialize, Default)] +struct FastclassFusion { + #[serde(default)] + final_score: f64, +} + +/// A bill's location in the dataset, parsed from a fastclass result's `doc` +/// id — which `govbot source --select docs` set to the bill's directory path. +struct BillRoute { + /// The dataset's `short_name` — the path segment before `country:` in + /// the doc id (e.g. `wy-legislation`). `None` if the doc id has no + /// recognisable prefix. + dataset: Option, + country: String, + state: String, + session: String, + bill_id: String, +} + +/// Parse a `doc` id of the form +/// `/country:/state:/sessions//bills/` into the +/// pieces needed to place its `.tag.json` file. Returns `None` for any id that +/// is not a dataset bill path (e.g. a document from a non-govbot source). +/// +/// The leading `` segment is the dataset's `short_name` (e.g. +/// `wy-legislation`); it is what lets `govbot apply` route each tag file under +/// `/tags//...` by default — the dataset prefix is what +/// disambiguates same-named tag files across jurisdictions in a multi-dataset +/// project. +fn parse_doc_route(doc: &str) -> Option { + let segments: Vec<&str> = doc.split('/').collect(); + let (mut country, mut state, mut session, mut bill_id) = (None, None, None, None); + let mut country_idx: Option = None; + for (i, seg) in segments.iter().enumerate() { + if let Some(c) = seg.strip_prefix("country:") { + country = Some(c.to_string()); + if country_idx.is_none() { + country_idx = Some(i); + } + } else if let Some(s) = seg.strip_prefix("state:") { + state = Some(s.to_string()); + } else if *seg == "sessions" { + session = segments.get(i + 1).map(|s| s.to_string()); + } else if *seg == "bills" { + bill_id = segments.get(i + 1).map(|s| s.to_string()); } - return Ok(matched_tags); } - - // Otherwise, scan all .tag.json files in the tags directory - for entry in fs::read_dir(tags_dir)? { - let entry = entry?; - let path = entry.path(); - - if let Some(ext) = path.extension() { - if ext == "json" { - if let Some(stem) = path.file_stem().and_then(|s| s.to_str()) { - // Remove .tag suffix if present (e.g., "budget.tag" -> "budget") - let tag_name = stem.strip_suffix(".tag").unwrap_or(stem); - - match fs::read_to_string(&path) { - Ok(contents) => { - if let Ok(tag_file) = serde_json::from_str::(&contents) { - // Check if bill_id exists in bills map - if tag_file.bills.contains_key(bill_id) { - matched_tags.push(tag_name.to_string()); - } - } - } - Err(_) => { - // Skip files that can't be read - continue; - } - } - } - } + // Anything sitting in front of `country:` is the dataset short_name. + // For today's `/country:/...` shape that is exactly one + // segment, but tolerate nested prefixes by joining everything before the + // `country:` segment (skipping empties from a leading `/`). + let dataset = country_idx.and_then(|i| { + let prefix: Vec<&str> = segments[..i] + .iter() + .copied() + .filter(|s| !s.is_empty()) + .collect(); + if prefix.is_empty() { + None + } else { + Some(prefix.join("/")) } + }); + Some(BillRoute { + dataset, + country: country?, + state: state?, + session: session?, + bill_id: bill_id?, + }) +} + +/// Build a fresh `TagFile` for `tag_key`. The taxonomy now lives in a fastclass +/// classifier bundle, not in `govbot.yml`, so `tag_defs` is normally empty and +/// each tag file gets a minimal stub `tag_config` derived from the tag name. +fn new_tag_file(tag_key: &str, tag_defs: &[govbot::TagDefinition], now: &str) -> TagFile { + let tag_def = tag_defs + .iter() + .find(|td| td.name == tag_key) + .cloned() + .unwrap_or_else(|| govbot::TagDefinition { + name: tag_key.to_string(), + description: String::new(), + examples: Vec::new(), + include_keywords: Vec::new(), + exclude_keywords: Vec::new(), + negative_examples: Vec::new(), + threshold: 0.5, + }); + let tag_config_hash = hash_text(&serde_json::to_string(&tag_def).unwrap_or_default()); + TagFile { + metadata: TagFileMetadata { + last_run: now.to_string(), + model: "fastclass".to_string(), + tag_config_hash, + }, + tag_config: tag_def, + text_cache: HashMap::new(), + bills: HashMap::new(), } - - Ok(matched_tags) } -async fn run_tag_command(cmd: Command) -> anyhow::Result<()> { - let Command::Tag { +/// `govbot apply` — the persistence sink of the tagging pipeline. +/// +/// It classifies nothing. It reads `fastclass classify` result JSON from +/// stdin — the apply sink of +/// `govbot source --select docs | fastclass classify - | govbot apply` — and +/// for every matched tag writes the bill into the per-tag `.tag.json` file +/// under `/tags//country:.../sessions//`. Those are the +/// files `govbot publish` later turns into feeds. +/// +/// **Why `tags/` and not `.govbot/`:** `.govbot/` is the tool's cache — the +/// equivalent of `node_modules/` — and must stay user-edit-free so a fresh +/// `rm -rf .govbot/` never destroys the bot's classification work. Tag files +/// are derived classification *outputs*, not cache contents; they live in +/// their own dedicated, project-rooted directory peer to `dist/`. +async fn run_apply_command(cmd: Command) -> anyhow::Result<()> { + let Command::Apply { tag_name, output_dir, - govbot_dir, overwrite, - } = cmd else { + } = cmd + else { unreachable!() }; - // Check if govbot.yml exists in current directory let current_dir = std::env::current_dir()?; - let default_tags_cfg = current_dir.join("govbot.yml"); - - // Model/tokenizer directory: prefer user-specified govbot-dir or env GOVBOT_DIR, else default .govbot - let model_dir: PathBuf = if let Some(ref dir) = govbot_dir { - PathBuf::from(dir) - } else if let Ok(dir) = std::env::var("GOVBOT_DIR") { - PathBuf::from(dir) - } else { - current_dir.join(".govbot") - }; - fs::create_dir_all(&model_dir)?; - let model_path = model_dir.join("model.onnx"); - let tokenizer_path = model_dir.join("tokenizer.json"); - - // Require govbot.yml - if !default_tags_cfg.exists() { - return Err(anyhow::anyhow!( - "govbot.yml not found in current directory" - )); - } - - // Load tag definitions (needed for both embedding and keyword fallback) - let tag_defs = govbot::embeddings::load_tags_config(&default_tags_cfg) - .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; + // Tag files land under --output-dir when given. When unset, each tag file + // is routed under the project's classification-output directory + // `/tags//country:.../sessions/.../.tag.json` + // — the dataset short_name comes from the first segment of the fastclass + // result's `doc` field, mirroring where the bill's `metadata.json` came + // from. The explicit `--output-dir` override stays a verbatim root (the + // dataset prefix is dropped), which is the back-compat escape hatch for + // callers that want to write into a custom layout. + let explicit_output_dir = output_dir.as_ref().map(PathBuf::from); + let default_tags_root = current_dir.join("tags"); - // Try embedding mode first - let embedding_matcher = if ensure_embedding_files(&model_dir) { - let tags_path = default_tags_cfg.clone(); + // The taxonomy now lives in a fastclass classifier bundle, not in + // govbot.yml — each `.tag.json` is stamped with a stub `tag_config` + // derived only from the matched tag name. + let tag_defs: Vec = Vec::new(); - eprintln!("Using embedding mode:"); - eprintln!(" Model: {}", model_path.display()); - eprintln!(" Tokenizer: {}", tokenizer_path.display()); - eprintln!(" Tags config: {}", tags_path.display()); - - match TagMatcher::from_files(&model_path, &tokenizer_path, &tags_path) { - Ok(matcher) => Some(matcher), - Err(e) => { - eprintln!("Warning: Failed to initialize embedding matcher: {}", e); - eprintln!("Falling back to keyword-based matching."); - None - } - } - } else { - eprintln!("Embedding files not available; using keyword-based matching."); - eprintln!(" Tags config: {}", default_tags_cfg.display()); - None - }; - - // Determine output directory - // If govbot.yml exists, use its directory as the base output directory - let base_output_dir = if default_tags_cfg.exists() { - // Use the directory containing govbot.yml - default_tags_cfg.parent() - .unwrap_or(¤t_dir) - .to_path_buf() - } else if let Some(ref dir) = output_dir { - PathBuf::from(dir) - } else if let Some(ref dir) = govbot_dir { - PathBuf::from(dir) - } else if let Ok(dir) = std::env::var("GOVBOT_DIR") { - PathBuf::from(dir) - } else { - // Default to current directory - current_dir - }; - - // Read JSON lines from stdin let stdin = io::stdin(); let reader = BufReader::new(stdin.lock()); - - let mut processed_count = 0; - let mut skipped_count = 0; - let mut read_count: usize = 0; - - eprintln!("Reading JSON lines from stdin..."); - + let now = chrono::Utc::now().to_rfc3339(); + let mut written = 0usize; + let mut skipped = 0usize; + + eprintln!("Reading fastclass classification results from stdin..."); + // Track per-dataset write counts so the final summary reflects where the + // tag files actually landed. + let mut written_dirs: std::collections::BTreeSet = Default::default(); for line_result in reader.lines() { let line = line_result?; let line = line.trim(); if line.is_empty() { - read_count += 1; - if read_count % 100 == 0 { - eprintln!("Read {} lines (processed {}, skipped {})...", read_count, processed_count, skipped_count); - } continue; } - - read_count += 1; - // Parse JSON line (assumes default selector format) - match serde_json::from_str::(line) { - Ok(json_value) => { - // Extract bill_id from top-level "id" field (default selector format) - let bill_id_opt = json_value - .get("id") - .and_then(|id| id.as_str()); - - // Extract text from JSON for embedding comparison - let bill_text = ocd_files_select_default(&json_value); - - // Extract path info from sources.log (default selector format) - let path_info = json_value - .get("sources") - .and_then(|sources| sources.get("log")) - .and_then(|path| path.as_str()) - .and_then(|log_path| extract_path_info(log_path)) - .or_else(|| { - // Fallback: use default values if we can't determine - Some(("us".to_string(), "unknown".to_string(), "unknown".to_string())) - }); - - // Process if we have path info (from sources.log in default selector format) - if let Some((country, state, session_id)) = path_info { - // Get bill_id - use "id" from default selector, or generate from text hash if missing - let bill_id = bill_id_opt.map(|s| s.to_string()).unwrap_or_else(|| { - let text_hash = hash_text(&bill_text); - format!("entry_{}", &text_hash[..8]) - }); - - // Determine tags directory - let tags_dir = base_output_dir - .join(&format!("country:{}", country)) - .join(&format!("state:{}", state)) - .join("sessions") - .join(&session_id) - .join("tags"); - - // Validate tag_name if provided - if let Some(ref requested_tag) = tag_name { - if !tag_defs.iter().any(|td| td.name == *requested_tag) { - return Err(anyhow::anyhow!( - "Tag '{}' not found in govbot.yml. Available tags: {}", - requested_tag, - tag_defs.iter().map(|td| td.name.clone()).collect::>().join(", ") - )); - } - } - - // Fast path: check if bill is already tagged (unless overwrite is set) - let mut matched_tags: Vec = Vec::new(); - let mut should_run_tagging = overwrite; - - if !overwrite { - match check_existing_tags(&tags_dir, &bill_id, tag_name.as_deref()) { - Ok(existing_tags) => { - if !existing_tags.is_empty() { - // Bill is already tagged - output the line and skip tagging - matched_tags = existing_tags; - should_run_tagging = false; - } else { - // Bill not found in tag file(s) - need to run tagging - should_run_tagging = true; - } - } - Err(e) => { - // Error checking tags - run tagging to be safe - eprintln!("Warning: Error checking existing tags for {}: {}", bill_id, e); - should_run_tagging = true; - } - } - } - - // Run tagging logic if needed - if should_run_tagging { - // Choose strategy based on mode - let mut tags: Vec = if let Some(matcher) = embedding_matcher.as_ref() { - match matcher.match_json_value(&json_value) { - Ok(results) => results, - Err(e) => { - eprintln!("Error running embedding matcher for bill {}: {}", bill_id, e); - eprintln!("Falling back to keyword-based matching for this entry."); - // Fall back to keyword matching for this entry - govbot::embeddings::match_tags_keywords(&tag_defs, &json_value) - } - } - } else { - // Use keyword-based fallback matcher - govbot::embeddings::match_tags_keywords(&tag_defs, &json_value) - }; - - // Filter to specific tag if requested - if let Some(ref requested_tag) = tag_name { - tags.retain(|(tag, _)| tag == requested_tag); - } - - // Extract tag names from results - matched_tags = tags.iter().map(|(tag_name, _)| tag_name.clone()).collect(); - - // Save tags to files if we found matches - if !tags.is_empty() { - let text_hash = hash_text(&bill_text); - - // Write per-tag files immediately - fs::create_dir_all(&tags_dir)?; - - // Get current timestamp for metadata - let now = chrono::Utc::now().to_rfc3339(); - let model_path_str = if embedding_matcher.is_some() { - model_path.to_string_lossy().to_string() - } else { - "keyword-fallback".to_string() - }; - - for (tag_key, score_breakdown) in tags { - let tag_path = tags_dir.join(format!("{}.tag.json", tag_key)); - - // Load or create TagFile structure - let mut tag_file: TagFile = if tag_path.exists() { - match fs::read_to_string(&tag_path) { - Ok(contents) => { - serde_json::from_str(&contents).unwrap_or_else(|_| { - // If parsing fails, create a new TagFile - let tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| govbot::TagDefinition { - name: tag_key.clone(), - description: String::new(), - examples: Vec::new(), - include_keywords: Vec::new(), - exclude_keywords: Vec::new(), - negative_examples: Vec::new(), - threshold: 0.5, - }); - - let tag_config_hash = hash_text(&serde_json::to_string(&tag_def).unwrap_or_default()); - - TagFile { - metadata: TagFileMetadata { - last_run: now.clone(), - model: model_path_str.clone(), - tag_config_hash, - }, - tag_config: tag_def, - text_cache: HashMap::new(), - bills: HashMap::new(), - } - }) - } - Err(_) => { - // Create new TagFile - let tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| govbot::TagDefinition { - name: tag_key.clone(), - description: String::new(), - examples: Vec::new(), - include_keywords: Vec::new(), - exclude_keywords: Vec::new(), - negative_examples: Vec::new(), - threshold: 0.5, - }); - - let tag_config_hash = hash_text(&serde_json::to_string(&tag_def)?); - - TagFile { - metadata: TagFileMetadata { - last_run: now.clone(), - model: model_path_str.clone(), - tag_config_hash, - }, - tag_config: tag_def, - text_cache: HashMap::new(), - bills: HashMap::new(), - } - } - } - } else { - // Create new TagFile - let tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| govbot::TagDefinition { - name: tag_key.clone(), - description: String::new(), - examples: Vec::new(), - include_keywords: Vec::new(), - exclude_keywords: Vec::new(), - negative_examples: Vec::new(), - threshold: 0.5, - }); - - let tag_config_hash = hash_text(&serde_json::to_string(&tag_def)?); - - TagFile { - metadata: TagFileMetadata { - last_run: now.clone(), - model: model_path_str.clone(), - tag_config_hash, - }, - tag_config: tag_def, - text_cache: HashMap::new(), - bills: HashMap::new(), - } - }; + let result: FastclassResult = match serde_json::from_str(line) { + Ok(r) => r, + Err(e) => { + eprintln!("Warning: skipping unparseable result line: {}", e); + skipped += 1; + continue; + } + }; + let Some(route) = parse_doc_route(&result.doc) else { + eprintln!( + "Warning: skipping '{}' — its id is not a dataset bill path. \ + Stream documents in with `govbot source --select docs`.", + result.doc + ); + skipped += 1; + continue; + }; - // Update metadata - tag_file.metadata.last_run = now.clone(); - tag_file.metadata.model = model_path_str.clone(); - - // Update tag config if it changed - let current_tag_def = tag_defs - .iter() - .find(|td| td.name == tag_key) - .cloned() - .unwrap_or_else(|| tag_file.tag_config.clone()); - - let current_config_hash = hash_text(&serde_json::to_string(¤t_tag_def)?); - if current_config_hash != tag_file.metadata.tag_config_hash { - tag_file.tag_config = current_tag_def; - tag_file.metadata.tag_config_hash = current_config_hash; - } - - // Add text to cache if not present - if !tag_file.text_cache.contains_key(&text_hash) { - tag_file.text_cache.insert(text_hash.clone(), bill_text.clone()); - } - - // Add/update bill result - tag_file.bills.insert(bill_id.to_string(), BillTagResult { - text_hash: text_hash.clone(), - score: score_breakdown, - }); - - // Write updated TagFile - let json_string = serde_json::to_string_pretty(&tag_file)?; - fs::write(&tag_path, json_string)?; - } - } - } - - // Output the line if it matches tags (filter mode) - // If a specific tag was requested, only output if that tag matches - // Otherwise, output if any tag matches - let should_output = if let Some(ref requested_tag) = tag_name { - matched_tags.contains(requested_tag) - } else { - !matched_tags.is_empty() - }; - - if should_output { - write_json_line(line)?; - } - - processed_count += 1; - if processed_count % 50 == 0 { - eprintln!("Processed {} entries (matched: {} tags)...", processed_count, matched_tags.len()); - } - } else { - // No path info - skip this entry (default selector should always provide sources.log) - skipped_count += 1; - } + // The tags this bill matched, optionally narrowed to one requested tag. + let mut matched: Vec<(String, f64)> = Vec::new(); + for (name, tag) in &result.tags { + if !tag.matched { + continue; } - Err(_e) => { - // Skip malformed/empty lines quietly - skipped_count += 1; + if let Some(req) = &tag_name { + if req != name { + continue; + } } + matched.push((name.clone(), tag.fusion.final_score)); + } + if matched.is_empty() { + continue; } - if read_count % 100 == 0 { - eprintln!("Read {} lines (processed {}, skipped {})...", read_count, processed_count, skipped_count); + // Resolve where this bill's tag files land. With an explicit + // `--output-dir`, that path is the root and the dataset short_name is + // dropped (back-compat escape hatch). With no override, route the file + // under the project's `tags//...` output dir so the dataset + // prefix disambiguates same-named tags across jurisdictions. If the + // `doc` id lacks a recognisable dataset prefix (a non-govbot source), + // fall back to a no-prefix `tags/` so the record is still persisted — + // never write into `.govbot/`, which is the tool's cache. + let base_output_dir = match (&explicit_output_dir, &route.dataset) { + (Some(root), _) => root.clone(), + (None, Some(dataset)) => default_tags_root.join(dataset), + (None, None) => default_tags_root.clone(), + }; + // Inside the dataset prefix, mirror the source's jurisdiction path + // exactly — no trailing `/tags/` segment, because the project-level + // `tags/` directory already names the kind. The shape on disk is + // `//country:.../state:.../sessions//.tag.json`. + let tags_dir = base_output_dir + .join(format!("country:{}", route.country)) + .join(format!("state:{}", route.state)) + .join("sessions") + .join(&route.session); + fs::create_dir_all(&tags_dir)?; + written_dirs.insert(base_output_dir.clone()); + + for (tag_key, final_score) in matched { + let tag_path = tags_dir.join(format!("{}.tag.json", tag_key)); + + // Update the existing tag file, or start a fresh one. + let mut tag_file: TagFile = fs::read_to_string(&tag_path) + .ok() + .and_then(|c| serde_json::from_str(&c).ok()) + .unwrap_or_else(|| new_tag_file(&tag_key, &tag_defs, &now)); + + // With --overwrite off, an already-tagged bill is left untouched. + if !overwrite && tag_file.bills.contains_key(&route.bill_id) { + continue; + } + + tag_file.metadata.last_run = now.clone(); + tag_file.metadata.model = "fastclass".to_string(); + tag_file.bills.insert( + route.bill_id.clone(), + BillTagResult { + text_hash: result.text_hash.clone(), + score: govbot::ScoreBreakdown { + final_score, + base_embedding: None, + example_similarity: None, + keyword_match: Vec::new(), + negative_penalty: 0.0, + }, + }, + ); + fs::write(&tag_path, serde_json::to_string_pretty(&tag_file)?)?; } + written += 1; } - - eprintln!("\nProcessed: {}, Skipped: {}", processed_count, skipped_count); - eprintln!("\n✅ Tagging complete!"); - + + let dirs_summary = if written_dirs.is_empty() { + explicit_output_dir + .as_ref() + .map(|d| d.display().to_string()) + .unwrap_or_else(|| default_tags_root.display().to_string()) + } else { + written_dirs + .iter() + .map(|p| p.display().to_string()) + .collect::>() + .join(", ") + }; + eprintln!( + "\n✅ Persisted {} tagged bill(s) under {}; skipped {} entr(ies).", + written, dirs_summary, skipped + ); Ok(()) } -async fn run_build_command(cmd: Command) -> anyhow::Result<()> { - let Command::Build { - tags, +/// `govbot publish` — run the manifest's publishers. +/// +/// Reads `govbot.yml`'s typed `publish:` map, collects the tagged result +/// stream from `govbot source`, and runs each named publisher (`rss`/`html`/ +/// `json`/`duckdb`) against it. The publisher's tag list comes from +/// `publish..select`; the retired `tags:` manifest block is gone. +async fn run_publish_command(cmd: Command) -> anyhow::Result<()> { + let Command::Publish { + publishers, limit, output_dir, output_file, + dry_run, govbot_dir, - } = cmd else { + } = cmd + else { unreachable!() }; - - // Check if govbot.yml exists in current directory + let current_dir = std::env::current_dir()?; let config_path = current_dir.join("govbot.yml"); - if !config_path.exists() { return Err(anyhow::anyhow!("govbot.yml not found in current directory")); } - - // Load configuration - let config = load_config(&config_path)?; - - // Get tags configuration - let tags_config = config.get("tags") - .and_then(|t| t.as_object()) - .ok_or_else(|| anyhow::anyhow!("No tags found in configuration"))?; - - // Determine which tags to use - let tags_to_use: Vec = if tags.is_empty() { - // Use tags from build config, or all tags - if let Some(build_tags) = config.get("build") - .and_then(|p| p.get("tags")) - .and_then(|t| t.as_array()) - { - build_tags - .iter() - .filter_map(|v| v.as_str().map(|s| s.to_string())) - .collect() - } else { - tags_config.keys().cloned().collect() - } - } else { - tags - }; - - // Validate tags exist - for tag in &tags_to_use { - if !tags_config.contains_key(tag) { - return Err(anyhow::anyhow!("Tag '{}' not found in configuration", tag)); - } - } - - if tags_to_use.is_empty() { - return Err(anyhow::anyhow!("No valid tags to process")); - } - - // Get build configuration - let build_config = config.get("build").and_then(|p| p.as_object()); - - // Get output directory - let output_dir_path = if let Some(dir) = output_dir { - PathBuf::from(dir) - } else { - let dir_str = build_config - .and_then(|p| p.get("output_dir")) - .and_then(|d| d.as_str()) - .unwrap_or("docs"); - PathBuf::from(dir_str) - }; - - // Get output filename - let output_filename = if let Some(file) = output_file { - file + + // Typed manifest — `publish:` is the publisher map. + let manifest = load_manifest(&config_path)?; + if manifest.publish.is_empty() { + return Err(anyhow::anyhow!( + "govbot.yml has no `publish:` publishers to run" + )); + } + + // Which publishers to run: all of them, or the requested subset. + let names_to_run: Vec = if publishers.is_empty() { + manifest.publish.keys().cloned().collect() } else { - build_config - .and_then(|p| p.get("output_file")) - .and_then(|f| f.as_str()) - .unwrap_or("feed.xml") - .to_string() - }; - - // Get feed metadata - let feed_title = build_config - .and_then(|p| p.get("title")) - .and_then(|t| t.as_str()) - .map(|s| s.to_string()) - .unwrap_or_else(|| { - format!("{} Legislation", tags_to_use.iter() - .map(|t| t.replace('_', " ").split_whitespace() - .map(|w| { - let mut chars = w.chars(); - match chars.next() { - None => String::new(), - Some(f) => f.to_uppercase().collect::() + chars.as_str(), - } - }) - .collect::>() - .join(" ")) - .collect::>() - .join(" & ")) - }); - - let feed_description = build_config - .and_then(|p| p.get("description")) - .and_then(|d| d.as_str()) - .map(|s| s.to_string()) - .unwrap_or_else(|| { - let mut descs = Vec::new(); - for tag_name in &tags_to_use { - if let Some(tag_obj) = tags_config.get(tag_name).and_then(|t| t.as_object()) { - if let Some(desc) = tag_obj.get("description").and_then(|d| d.as_str()) { - let tag_title = tag_name.replace('_', " ").split_whitespace() - .map(|w| { - let mut chars = w.chars(); - match chars.next() { - None => String::new(), - Some(f) => f.to_uppercase().collect::() + chars.as_str(), - } - }) - .collect::>() - .join(" "); - descs.push(format!("{}: {}", tag_title, &desc[..desc.len().min(200)])); - } - } - } - if descs.is_empty() { - "Legislative updates".to_string() - } else { - descs.join(" | ") + for name in &publishers { + if !manifest.publish.contains_key(name) { + return Err(anyhow::anyhow!( + "publisher '{}' not found in govbot.yml `publish:`", + name + )); } - }); - - let feed_link = build_config - .and_then(|p| p.get("base_url")) - .and_then(|u| u.as_str()) - .unwrap_or("https://example.com"); - - let base_url = Some(feed_link); - - // Get repos - let repos = get_repos_from_config(&config); - - // Get repos to process - let repos_to_process: Vec = if repos == vec!["all".to_string()] { - Vec::new() // Empty means all repos - } else { - repos - }; - - // Get limit - parse "none" as no limit, otherwise parse as usize - // Default to 100 if not specified - let limit_str_opt = limit.or_else(|| { - build_config - .and_then(|p| p.get("limit")) - .and_then(|l| { - if let Some(s) = l.as_str() { - Some(s.to_string()) - } else if let Some(n) = l.as_u64() { - Some(n.to_string()) - } else { - None - } - }) - }); - - let limit_value: Option = if let Some(limit_str) = limit_str_opt { - if limit_str.to_lowercase() == "none" { - None // No limit - } else { - limit_str.parse().ok() } - } else { - Some(100) // Default to 100 items + publishers }; - - // Run logs command and collect entries - eprintln!("Collecting log entries for tags: {}", tags_to_use.join(", ")); - let mut entries = Vec::new(); - - // Get the base govbot directory (not the repos subdirectory) - // The logs command expects the base directory and will append /repos itself + + // Resolve the base govbot directory for the `source` subprocess. let base_govbot_dir = if let Some(ref gd) = govbot_dir { gd.clone() } else if let Ok(gd) = std::env::var("GOVBOT_DIR") { gd } else { - // Default: $CWD/.govbot std::env::current_dir() .unwrap_or_else(|_| PathBuf::from(".")) .join(".govbot") .to_string_lossy() .to_string() }; - - // Call logs command as subprocess and parse JSON output - // Use current executable (govbot binary) - let exe = std::env::current_exe() - .unwrap_or_else(|_| PathBuf::from("govbot")); - - let mut cmd = ProcessCommand::new(exe); - cmd.arg("logs") + + // Collect the dataset record stream once: `govbot source` over all + // datasets (an empty `--repos` means every dataset). + let datasets_to_process: Vec = if manifest.datasets == vec!["all".to_string()] { + Vec::new() + } else { + manifest.datasets.clone() + }; + + let exe = std::env::current_exe().unwrap_or_else(|_| PathBuf::from("govbot")); + let mut source_cmd = ProcessCommand::new(exe); + source_cmd + .arg("source") .arg("--join") .arg("bill,tags") .arg("--select") @@ -2130,207 +2559,1967 @@ async fn run_build_command(cmd: Command) -> anyhow::Result<()> { .arg("default") .arg("--sort") .arg("DESC"); - - // Only add --govbot-dir if it's not the default if !base_govbot_dir.is_empty() && base_govbot_dir != ".govbot" { - cmd.arg("--govbot-dir").arg(&base_govbot_dir); - } - - if !repos_to_process.is_empty() { - cmd.arg("--repos"); - for repo in &repos_to_process { - cmd.arg(repo); - } - } - - // Don't pass limit to logs command - we'll limit after filtering/sorting - // This ensures we get the best entries, not just the first N from each repo - - let output = cmd.output()?; - - // Check return code + source_cmd.arg("--govbot-dir").arg(&base_govbot_dir); + } + if !datasets_to_process.is_empty() { + source_cmd.arg("--repos"); + for d in &datasets_to_process { + source_cmd.arg(d); + } + } + + let output = source_cmd.output()?; if !output.status.success() { let stderr_str = String::from_utf8_lossy(&output.stderr); - eprintln!("Error: logs command failed with exit code: {:?}", output.status.code()); + eprintln!("Error: source command failed: {:?}", output.status.code()); eprintln!("Stderr: {}", stderr_str); - return Err(anyhow::anyhow!("Failed to collect log entries")); + return Err(anyhow::anyhow!("Failed to collect dataset records")); } - - // Check if there were any errors in stderr (but compilation messages are OK) - if !output.stderr.is_empty() { - let stderr_str = String::from_utf8_lossy(&output.stderr); - // Filter out compilation messages - let filtered_stderr: Vec<&str> = stderr_str - .lines() - .filter(|line| !line.contains("Compiling") && !line.contains("Finished")) - .collect(); - if !filtered_stderr.is_empty() { - eprintln!("Warning from logs command: {}", filtered_stderr.join("\n")); - } - } - - // Parse JSON lines from output - let mut total_entries = 0; - let mut filtered_entries = 0; + let stdout_str = String::from_utf8_lossy(&output.stdout); - if stdout_str.trim().is_empty() { - eprintln!("Warning: logs command returned no output. Make sure repositories are cloned and contain log files."); + eprintln!( + "Warning: source returned no output. Make sure datasets are pulled \ + and contain records." + ); } - + let mut all_entries: Vec = Vec::new(); for line in stdout_str.lines() { let line = line.trim(); if line.is_empty() { continue; } match serde_json::from_str::(line) { - Ok(entry) => { - total_entries += 1; - if filter_by_tags(&entry, &tags_to_use) { - entries.push(entry); - filtered_entries += 1; - } - } + Ok(entry) => all_entries.push(entry), Err(e) => { - // Skip invalid JSON lines (might be compilation output that leaked through) if !line.contains("Compiling") && !line.contains("Finished") { eprintln!("Warning: Failed to parse JSON line: {}", e); } } } } - - if total_entries == 0 { - eprintln!("Warning: No log entries found. Make sure repositories are cloned and contain log files."); - } else if filtered_entries == 0 && !tags_to_use.is_empty() { - eprintln!("Warning: Found {} entries but none matched the specified tags. Entries may not have tags yet - consider running 'govbot tag' first, or build without --tags to include all entries.", total_entries); - } - - // Deduplicate and sort - entries = deduplicate_entries(entries); - entries = sort_by_timestamp(entries); - - // Apply limit (default is 100) - let original_count = entries.len(); - if let Some(lim) = limit_value { - entries.truncate(lim); - if original_count > lim { - eprintln!("Limited feed to {} entries (RSS standard). Use --limit none to include all {} entries.", lim, original_count); - } - } - - // Create output directory - fs::create_dir_all(&output_dir_path)?; - - // Generate RSS - eprintln!("Generating RSS feed with {} entries...", entries.len()); - let rss_xml = rss::json_to_rss( - entries.clone(), - &feed_title, - &feed_description, - feed_link, - base_url.as_deref(), - "en-us", - ); - - // Write RSS feed - let rss_output_path = output_dir_path.join(&output_filename); - fs::write(&rss_output_path, rss_xml)?; - eprintln!("✓ Generated RSS feed: {}", rss_output_path.display()); - - // Generate HTML - eprintln!("Generating HTML index with {} entries...", entries.len()); - // Only pass title if it was explicitly set in config (not auto-generated) - let html_title = build_config - .and_then(|p| p.get("title")) - .and_then(|t| t.as_str()) - .filter(|s| !s.trim().is_empty()); - let html_content = rss::json_to_html( - entries, - html_title, - feed_link, - base_url.as_deref(), - ); - - // Write HTML index - let html_output_path = output_dir_path.join("index.html"); - fs::write(&html_output_path, html_content)?; - eprintln!("✓ Generated HTML index: {}", html_output_path.display()); - eprintln!(" Tags included: {}", tags_to_use.join(", ")); - + + // CLI `--limit` overrides every publisher's configured limit. + let cli_limit: Option> = limit.map(|s| { + if s.eq_ignore_ascii_case("none") { + None + } else { + s.parse().ok() + } + }); + + // Resolve the companion html-publisher landing URL once: the bluesky + // publisher uses it as the default for `{link}` so a post links to the + // human-readable HTML index, not the raw metadata.json path under its + // own `base_url`. None when the manifest has no html publisher. + let html_entry_url: Option = manifest + .publish + .values() + .find(|p| p.kind == govbot::PublisherKind::Html) + .and_then(|p| p.base_url.clone()) + .filter(|u| !u.trim().is_empty()); + + // Run each named publisher against its filtered/sorted/limited stream. + for name in &names_to_run { + let publisher = manifest.publish.get(name).expect("checked above"); + let select = publisher.select.clone().unwrap_or_default(); + + eprintln!( + "\n=== Publisher '{}' ({:?}) — selecting tags: {} ===", + name, + publisher.kind, + if select.is_empty() { + "".to_string() + } else { + select.join(", ") + } + ); + + // Filter to the publisher's selected tags, dedup, sort. + // + // The bluesky publisher does its own **score-aware** per-bill dedup + // (highest-scoring log per (jurisdiction, bill_id) becomes the + // representative — see `bluesky::run_bluesky`); the global + // first-wins dedup would force a "newest" winner that drops a + // bill whose newest log carries no qualifying tag even when an + // older log scored above the threshold. Skip the global dedup for + // bluesky so the publisher sees every log for every bill. + let mut entries: Vec = all_entries + .iter() + .filter(|e| filter_by_tags(e, &select)) + .cloned() + .collect(); + if publisher.kind != govbot::PublisherKind::Bluesky { + entries = deduplicate_entries(entries); + } + entries = sort_by_timestamp(entries); + + // Apply the limit: CLI override, else the publisher's, else 100. + // + // **The limit is a per-bill cap**, not a per-action-log cap — for + // non-bluesky publishers that's already true (entries are + // pre-dedup'd by bill above). For bluesky we skipped the + // pre-dedup, so the entry stream still carries N action-log + // records per bill; truncating it here would arbitrarily clip + // bills before bluesky's own dedup runs. Skip the limit for + // bluesky and let the publisher cap **after** its score-aware + // per-bill dedup (a future enhancement; the runtime cost of + // posting every qualifying bill is already small relative to + // the activist's daily-digest expectations). + let limit_value: Option = match cli_limit { + Some(v) => v, + None => publisher.resolved_limit(Some(100)), + }; + let original_count = entries.len(); + if let Some(lim) = limit_value { + if publisher.kind != govbot::PublisherKind::Bluesky { + entries.truncate(lim); + if original_count > lim { + eprintln!( + "Limited '{}' to {} entries. Use --limit none for all {}.", + name, lim, original_count + ); + } + } + } + + let job = govbot::publish::PublishJob { + name, + publisher, + entries, + output_dir_override: output_dir.clone(), + output_file_override: output_file.clone(), + project_dir: current_dir.clone(), + dry_run, + html_entry_url: html_entry_url.clone(), + }; + govbot::publish::run_publisher(&job)?; + } + Ok(()) } async fn run_update_command() -> anyhow::Result<()> { let install_script_url = "https://raw.githubusercontent.com/chihacknight/govbot/main/actions/govbot/scripts/install-nightly.sh"; - + eprintln!("🔄 Updating govbot to latest nightly version..."); - eprintln!("Downloading and running install script from: {}", install_script_url); - + eprintln!( + "Downloading and running install script from: {}", + install_script_url + ); + // Execute the install script by piping curl directly to sh // This avoids issues with shebang lines being interpreted as commands let mut cmd = ProcessCommand::new("sh"); cmd.arg("-c"); cmd.arg(&format!("curl -fsSL {} | sh", install_script_url)); - + // Inherit stdin/stdout/stderr so the install script can interact with the user cmd.stdin(std::process::Stdio::inherit()); cmd.stdout(std::process::Stdio::inherit()); cmd.stderr(std::process::Stdio::inherit()); - + let status = cmd.status()?; - + if status.success() { eprintln!("\n✅ Update completed successfully!"); eprintln!("You may need to restart your terminal or run 'source ~/.zshrc' (or your shell profile) to use the updated version."); } else { - return Err(anyhow::anyhow!("Update failed with exit code: {}", status.code().unwrap_or(-1))); + return Err(anyhow::anyhow!( + "Update failed with exit code: {}", + status.code().unwrap_or(-1) + )); } - + Ok(()) } -#[tokio::main] -async fn main() -> anyhow::Result<()> { - let args = Args::parse(); +/// Locate the project's `govbot.yml`, erroring if there is none. +fn require_manifest_path() -> anyhow::Result { + let path = project_dir()?.join("govbot.yml"); + if !path.exists() { + anyhow::bail!( + "No govbot.yml in {}. Run `govbot init` to scaffold one.", + project_dir()?.display() + ); + } + Ok(path) +} - match args.command { - Some(cmd @ Command::Clone { .. }) => { - run_clone_command(cmd).await +/// `govbot add` — append validated dataset ids to `govbot.yml`'s `datasets:`. +fn run_add_command(cmd: Command) -> anyhow::Result<()> { + let Command::Add { datasets } = cmd else { + unreachable!() + }; + let manifest_path = require_manifest_path()?; + let registry = load_registry()?; + + // Validate every id against the registry before touching the file. + let mut to_add = Vec::new(); + for id in &datasets { + let id = id.trim(); + if id.is_empty() { + continue; } - Some(cmd @ Command::Delete { .. }) => { - run_delete_command(cmd).await + if id.eq_ignore_ascii_case("all") { + to_add.push("all".to_string()); + continue; } - Some(cmd @ Command::Logs { .. }) => { - run_logs_command(cmd).await + let resolved = registry.resolve(id).map_err(|e| anyhow::anyhow!("{}", e))?; + // Add the identifier the user typed (keeps `wy` short and familiar); + // resolution proved it valid. + let _ = resolved; + to_add.push(id.to_string()); + } + + // Parse the manifest, mutate `datasets`, write it back. + let contents = std::fs::read_to_string(&manifest_path)?; + let mut doc: serde_yaml::Value = serde_yaml::from_str(&contents) + .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; + + let datasets_node = doc + .get_mut("datasets") + .and_then(|v| v.as_sequence_mut()) + .ok_or_else(|| anyhow::anyhow!("govbot.yml has no `datasets:` list"))?; + + let mut added = Vec::new(); + for id in to_add { + let already = datasets_node + .iter() + .any(|v| v.as_str() == Some(id.as_str())); + if already { + eprintln!(" · {} already in datasets", id); + } else { + datasets_node.push(serde_yaml::Value::String(id.clone())); + added.push(id); } - Some(cmd @ Command::Load { .. }) => { - run_load_command(cmd).await + } + + if added.is_empty() { + eprintln!("Nothing to add."); + return Ok(()); + } + + let yaml = serde_yaml::to_string(&doc) + .map_err(|e| anyhow::anyhow!("Failed to serialize govbot.yml: {}", e))?; + std::fs::write(&manifest_path, yaml)?; + for id in &added { + eprintln!(" + added {}", id); + } + eprintln!( + "✅ Updated {}. Run `govbot pull` to fetch.", + manifest_path.display() + ); + Ok(()) +} + +/// `govbot remove` — drop dataset ids from `govbot.yml`'s `datasets:`. +fn run_remove_command(cmd: Command) -> anyhow::Result<()> { + let Command::Remove { datasets } = cmd else { + unreachable!() + }; + let manifest_path = require_manifest_path()?; + + let contents = std::fs::read_to_string(&manifest_path)?; + let mut doc: serde_yaml::Value = serde_yaml::from_str(&contents) + .map_err(|e| anyhow::anyhow!("Failed to parse govbot.yml: {}", e))?; + + let datasets_node = doc + .get_mut("datasets") + .and_then(|v| v.as_sequence_mut()) + .ok_or_else(|| anyhow::anyhow!("govbot.yml has no `datasets:` list"))?; + + let targets: Vec = datasets + .iter() + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty()) + .collect(); + + let before = datasets_node.len(); + let mut removed = Vec::new(); + datasets_node.retain(|v| { + if let Some(s) = v.as_str() { + if targets.iter().any(|t| t == s) { + removed.push(s.to_string()); + return false; + } } - Some(Command::Update) => { - run_update_command().await + true + }); + + if datasets_node.len() == before { + eprintln!("No matching datasets found in govbot.yml."); + return Ok(()); + } + + let yaml = serde_yaml::to_string(&doc) + .map_err(|e| anyhow::anyhow!("Failed to serialize govbot.yml: {}", e))?; + std::fs::write(&manifest_path, yaml)?; + for id in &removed { + eprintln!(" - removed {}", id); + } + eprintln!("✅ Updated {}.", manifest_path.display()); + Ok(()) +} + +/// `govbot ls` — list the project's manifest datasets and locally-cached ones. +fn run_ls_command(cmd: Command) -> anyhow::Result<()> { + let Command::Ls { govbot_dir, output } = cmd else { + unreachable!() + }; + let registry = load_registry()?; + let repos_dir = get_govbot_dir(govbot_dir)?; + let local: Vec = git::get_local_datasets(&repos_dir).unwrap_or_default(); + + // The manifest's declared datasets, if a govbot.yml exists. + let manifest_path = project_dir()?.join("govbot.yml"); + let manifest_datasets: Vec = if manifest_path.exists() { + match govbot::Manifest::load(&manifest_path) { + Ok(m) => m.datasets, + Err(_) => Vec::new(), } - Some(cmd @ Command::Tag { .. }) => { - run_tag_command(cmd).await + } else { + Vec::new() + }; + + if output == "json" { + let out = serde_json::json!({ + "manifest": manifest_datasets, + "cached": local, + "registry_total": registry.datasets.len(), + }); + println!("{}", serde_json::to_string_pretty(&out)?); + return Ok(()); + } + + if !manifest_datasets.is_empty() { + println!("Manifest datasets (govbot.yml):"); + for d in &manifest_datasets { + let cached = local.iter().any(|c| c == d) || d == "all"; + let mark = if cached { "✓" } else { "·" }; + println!(" {} {}", mark, d); } - Some(cmd @ Command::Build { .. }) => { - run_build_command(cmd).await + println!(); + } + + println!("Cached locally ({}):", local.len()); + if local.is_empty() { + println!(" (none — run `govbot pull` to fetch)"); + } else { + for d in &local { + println!(" {}", d); } - None => { - let cwd = std::env::current_dir()?; - let config_path = cwd.join("govbot.yml"); - if !config_path.exists() { - // Generate govbot.yml: interactive wizard or defaults - if std::io::IsTerminal::is_terminal(&std::io::stdin()) { - govbot::wizard::run_wizard()?; - } else { + } + + // With no project manifest, list the registry — the help promises this so + // `govbot ls` in a bare directory is genuinely useful for discovery. + if manifest_datasets.is_empty() { + println!(); + println!( + "Registry ({} dataset(s) — run `govbot search` to filter):", + registry.datasets.len() + ); + for d in registry.all() { + let name = d.entry.name.as_deref().unwrap_or(""); + println!(" {:<28} {}", d.id, name); + } + } + Ok(()) +} + +/// `govbot search` — query the dataset registry. +fn run_search_command(cmd: Command) -> anyhow::Result<()> { + let Command::Search { query, output } = cmd else { + unreachable!() + }; + let registry = load_registry()?; + let query_str = query.join(" "); + let hits = registry.search(&query_str); + + if output == "json" { + let rows: Vec<_> = hits + .iter() + .map(|d| { + serde_json::json!({ + "id": d.id, + "name": d.entry.name, + "git_url": d.entry.git_url, + "schema": d.entry.schema, + "path_pattern": d.entry.path_pattern, + }) + }) + .collect(); + println!("{}", serde_json::to_string_pretty(&rows)?); + return Ok(()); + } + + if hits.is_empty() { + eprintln!("No datasets match '{}'.", query_str); + return Ok(()); + } + println!("{} dataset(s):", hits.len()); + for d in &hits { + let name = d.entry.name.as_deref().unwrap_or(""); + println!(" {:<28} {}", d.id, name); + } + Ok(()) +} + +// --------------------------------------------------------------------------- +// `govbot doctor` — corpus-level data-integrity smoke test. +// +// Why this exists: two real-data bugs (7592418, 5ab6d3c) shipped because the +// only test harness was the mock dataset, which happened to fit a single +// happy-path layout. Both bugs would have been caught by a five-line check — +// "every emitted doc id is unique" and "every id resolves to a present +// metadata.json" — over a real pulled cache. `doctor` is that check, wired +// to a CLI verb activists can run after `pull all` to confirm the project +// is coherent before flipping `bluesky` off `--dry-run`. +// +// This is a smoke test, not a unit test. It assumes pulled data and skips +// cleanly when the cache is empty. +// --------------------------------------------------------------------------- + +/// Per-record sample, captured during the source walk so the metadata.json +/// and text checks can run after the stream is fully drained. +#[derive(Debug, Clone)] +struct DoctorSample { + id: String, + text_len: usize, +} + +/// Per-dataset rollup used to build the doctor report. +#[derive(Debug, Default)] +struct DatasetSummary { + record_count: usize, + distinct_ids: std::collections::HashSet, + samples: Vec, +} + +/// Outcome of one assertion bucket for one dataset — a short label, a +/// pass flag, an optional warn flag, and the detail lines (capped so a +/// broken dataset doesn't drown the report). A `warned` check still +/// counts as passing for the overall exit code — it surfaces noteworthy +/// state (e.g. zero records under `--filter default`) without failing CI. +#[derive(Debug, Clone)] +struct DoctorCheck { + name: &'static str, + passed: bool, + warned: bool, + detail: Vec, +} + +#[derive(Debug)] +struct DatasetReport { + dataset: String, + record_count: usize, + distinct_ids: usize, + sampled: usize, + checks: Vec, +} + +impl DatasetReport { + fn passed(&self) -> bool { + self.checks.iter().all(|c| c.passed) + } + fn warned(&self) -> bool { + self.checks.iter().any(|c| c.warned) + } +} + +/// Cap how many failing ids we print per check — keeps the report scannable +/// when an entire dataset is broken. +const MAX_FAIL_DETAIL: usize = 5; + +/// Default minimum acceptable `text` length per record. Anything shorter +/// is almost certainly a join failure (metadata.json missing or empty), not +/// a legitimate short bill. +const MIN_TEXT_LEN: usize = 50; + +/// Per-dataset distinct-id / record-count ratio floor. Bug 7592418 +/// collapsed 4916 records onto 97 ids (ratio 0.02). The floor is set +/// at 0.03 — high enough to flag a 100x collision, low enough to +/// accept a dataset where a handful of active bills emit many +/// substantive log records each (e.g. a state with sustained voting +/// activity on the same few bills). Drop it further if a clean cache +/// shows legitimate sub-0.03 ratios. +const MIN_DISTINCT_RATIO: f64 = 0.03; + +/// Map a parsed `parse_doc_route` dataset prefix (e.g. `nj-legislation`) +/// to the bare short_name (`nj`) that `git::get_local_datasets` returns. +/// This is the only place where doc-id prefixes and on-disk dataset +/// short names meet; getting it wrong silently breaks the per-dataset +/// bucketing. +fn dataset_short_name(prefix: &str, suffix: &str) -> String { + if let Some(s) = prefix.strip_suffix(suffix) { + s.to_string() + } else if let Some(s) = prefix.strip_suffix("-data-pipeline") { + s.to_string() + } else { + prefix.to_string() + } +} + +fn run_doctor_command(cmd: Command) -> anyhow::Result<()> { + let Command::Doctor { + govbot_dir, + sample, + limit, + output, + } = cmd + else { + unreachable!() + }; + + let repos_dir = get_govbot_dir(govbot_dir.clone())?; + + // Skip-cleanly contract: an empty or absent cache is not a failure. + // `doctor` is a smoke test, not a unit test — it has nothing to check + // until data is pulled. Exit 0 with a clear note. + if !repos_dir.exists() { + let note = format!( + "doctor: no cache at {} — run `govbot pull all` first. Skipping.", + repos_dir.display() + ); + if output == "json" { + println!( + "{}", + serde_json::json!({ "status": "skipped", "reason": note }) + ); + } else { + eprintln!("{}", note); + } + return Ok(()); + } + + let datasets = match git::get_local_datasets(&repos_dir) { + Ok(d) => d, + Err(e) => anyhow::bail!("doctor: failed to enumerate cached datasets: {}", e), + }; + + // Stale or broken entries in `repos/` — names that look like dataset + // links (matching the configured suffix) but don't resolve to a real + // directory. A broken symlink is the canonical case; the entry sits + // in `repos/` but `get_local_datasets` filtered it out because + // `is_dir()` follows the link and returns false. Surface these so + // they're not invisible — they break `govbot source` for that state + // without any other signal. + let broken_dataset_entries = enumerate_broken_dataset_entries(&repos_dir); + + if datasets.is_empty() { + let note = format!( + "doctor: {} is empty — run `govbot pull all` first. Skipping.", + repos_dir.display() + ); + if output == "json" { + println!( + "{}", + serde_json::json!({ "status": "skipped", "reason": note }) + ); + } else { + eprintln!("{}", note); + } + return Ok(()); + } + + // Resolve the parent govbot-dir for the subprocess `--govbot-dir` arg. + // `get_govbot_dir` appends `/repos`; we pass the parent so the child + // appends its own `/repos` and lands on the same path. + let govbot_dir_arg = repos_dir + .parent() + .map(|p| p.to_string_lossy().to_string()) + .unwrap_or_else(|| ".govbot".to_string()); + + let started = std::time::Instant::now(); + + // Stream every record once, in --select docs --limit none mode. We use a + // subprocess so we exercise the same code path activists hit; doctor is + // a "what does `govbot source` actually emit?" check, not a re-derivation. + let stream = collect_doc_stream(&govbot_dir_arg, &limit) + .map_err(|e| anyhow::anyhow!("doctor: source stream failed: {}", e))?; + + // Bucket records by dataset short_name. The doc id carries the full + // `-legislation` (or legacy `-data-pipeline`) repo dir + // prefix; `get_local_datasets` returns the bare short_name, so we + // normalise both to the short form before keying. + let mut per_dataset: HashMap = HashMap::new(); + let mut unrouted: Vec = Vec::new(); + + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + + for rec in &stream { + let id = rec.id.clone(); + + // Route to a dataset via the `/country:...` prefix in the + // id. A record we can't route is recorded for the global report; it + // can't contribute to per-dataset coverage. + let dataset_short = parse_doc_route(&id) + .and_then(|r| r.dataset) + .map(|d| dataset_short_name(&d, &suffix)); + match dataset_short { + Some(d) => { + let entry = per_dataset.entry(d).or_default(); + entry.record_count += 1; + entry.distinct_ids.insert(id.clone()); + if entry.samples.len() < sample { + entry.samples.push(DoctorSample { + id, + text_len: rec.text_len, + }); + } + } + None => { + if unrouted.len() < MAX_FAIL_DETAIL { + unrouted.push(id); + } + } + } + } + + // Build per-dataset reports. The four per-dataset checks are: coverage + // (≥1 record), id-distinctness (the bug 7592418 signature — many + // records collapsing onto one id), sampled-metadata-json-resolves, + // and sampled-text-length. + let mut dataset_reports: Vec = Vec::with_capacity(datasets.len()); + for dataset in &datasets { + let prefix = git::repo_dir_name(dataset); + let dataset_repo_dir = repos_dir.join(&prefix); + let summary = per_dataset.remove(dataset.as_str()).unwrap_or_default(); + + let mut checks = Vec::new(); + + // Coverage — a zero-record dataset is reported as a warning, + // not a failure: `--filter default` legitimately drops every + // record in a dataset whose only recent logs are routine + // (introductions, committee referrals). That state is normal + // for a freshly-cloned session early in its calendar. Doctor + // surfaces it so the activist can notice — pulled but silent — + // without failing the overall smoke test. + let coverage_warned = summary.record_count == 0; + let coverage_detail = if coverage_warned { + vec![format!( + "{} is linked but produced 0 records (likely an empty session or `--filter default` dropping every log — not necessarily broken)", + prefix + )] + } else { + Vec::new() + }; + checks.push(DoctorCheck { + name: "coverage", + passed: true, + warned: coverage_warned, + detail: coverage_detail, + }); + + // ID distinctness — bug 7592418 collapsed 4916 records onto 97 + // ids (ratio 0.02). After the fix it's ~0.81. A per-log emission + // pattern legitimately produces some duplicate ids (the same + // bill emitting multiple substantive log events), so we don't + // demand uniqueness — but we do demand the ratio stay well + // above the bug-case floor. Below MIN_DISTINCT_RATIO is the + // smoking gun. + let distinct = summary.distinct_ids.len(); + let total = summary.record_count; + let ratio = if total == 0 { + 1.0 + } else { + distinct as f64 / total as f64 + }; + let distinctness_passed = total == 0 || ratio >= MIN_DISTINCT_RATIO; + let distinctness_detail = if distinctness_passed { + Vec::new() + } else { + vec![format!( + "{}/{} distinct ids (ratio {:.2}) — below the {:.2} floor; ids are likely collapsing across distinct bills (the bug-7592418 signature)", + distinct, total, ratio, MIN_DISTINCT_RATIO + )] + }; + checks.push(DoctorCheck { + name: "id_distinctness", + passed: distinctness_passed, + warned: false, + detail: distinctness_detail, + }); + + // Metadata.json resolves + let mut metadata_failures: Vec = Vec::new(); + for s in &summary.samples { + if let Err(reason) = check_metadata_json(&s.id, &dataset_repo_dir) { + if metadata_failures.len() < MAX_FAIL_DETAIL { + metadata_failures.push(format!("{} :: {}", s.id, reason)); + } + } + } + checks.push(DoctorCheck { + name: "metadata_sampleable", + passed: metadata_failures.is_empty(), + warned: false, + detail: metadata_failures, + }); + + // Text length + let mut text_failures: Vec = Vec::new(); + for s in &summary.samples { + if s.text_len < MIN_TEXT_LEN && text_failures.len() < MAX_FAIL_DETAIL { + text_failures.push(format!( + "{} :: text length {} < {}", + s.id, s.text_len, MIN_TEXT_LEN + )); + } + } + checks.push(DoctorCheck { + name: "text_non_empty", + passed: text_failures.is_empty(), + warned: false, + detail: text_failures, + }); + + dataset_reports.push(DatasetReport { + dataset: dataset.clone(), + record_count: summary.record_count, + distinct_ids: summary.distinct_ids.len(), + sampled: summary.samples.len(), + checks, + }); + } + + // Build the global report. Global "duplicate ids" check is gone — + // per-log emission legitimately produces some duplicates. The id + // collapse bug (7592418) is caught per-dataset by id_distinctness. + let elapsed = started.elapsed(); + let total_records: usize = dataset_reports.iter().map(|r| r.record_count).sum(); + let total_distinct: usize = dataset_reports.iter().map(|r| r.distinct_ids).sum(); + let all_passed = unrouted.is_empty() + && broken_dataset_entries.is_empty() + && dataset_reports.iter().all(|r| r.passed()); + + if output == "json" { + emit_doctor_json( + &dataset_reports, + total_records, + total_distinct, + &unrouted, + &broken_dataset_entries, + elapsed, + all_passed, + ); + } else { + emit_doctor_text( + &dataset_reports, + total_records, + total_distinct, + &unrouted, + &broken_dataset_entries, + elapsed, + all_passed, + ); + } + + if !all_passed { + // Non-zero exit so a CI step `govbot doctor` fails the pipeline. + std::process::exit(1); + } + Ok(()) +} + +/// Names sitting in `/` that look like dataset entries (matching +/// the configured suffix) but don't resolve to a real directory — e.g. a +/// dangling symlink left over from a hand-edited cache, or a broken +/// pull. `get_local_datasets` silently filters these out; doctor surfaces +/// them as a global failure so they don't go unnoticed. +fn enumerate_broken_dataset_entries(repos_dir: &Path) -> Vec { + let suffix = std::env::var("GOVBOT_REPO_SUFFIX").unwrap_or_else(|_| "-legislation".to_string()); + let mut broken = Vec::new(); + let read = match std::fs::read_dir(repos_dir) { + Ok(r) => r, + Err(_) => return broken, + }; + for entry in read.flatten() { + let path = entry.path(); + let Some(name) = path.file_name().and_then(|n| n.to_str()) else { + continue; + }; + let looks_like_dataset = name.ends_with(&suffix) || name.ends_with("-data-pipeline"); + if !looks_like_dataset { + continue; + } + // `is_dir()` follows symlinks, so a dangling symlink reads false. + if !path.is_dir() { + broken.push(name.to_string()); + } + } + broken.sort(); + broken +} + +/// Minimal `{id,text,kind}` record drained from `govbot source --select docs`. +#[derive(Debug)] +struct DocRecord { + id: String, + text_len: usize, +} + +/// Invoke `govbot source --select docs --limit ` against the given +/// cache and return one `DocRecord` per emitted JSON line. We materialise +/// fully rather than streaming — the assertion set needs the whole corpus +/// before per-dataset ratios mean anything, and at the smoke-test limit +/// (default 100/repo, ~5000 records total) memory is a non-issue. +fn collect_doc_stream(govbot_dir: &str, limit: &str) -> std::io::Result> { + let exe = std::env::current_exe().unwrap_or_else(|_| PathBuf::from("govbot")); + let mut source_cmd = ProcessCommand::new(&exe); + source_cmd + .arg("source") + .arg("--select") + .arg("docs") + .arg("--limit") + .arg(limit) + .arg("--filter") + .arg("default") + .arg("--join") + .arg("bill") + .arg("--sort") + .arg("DESC") + .arg("--govbot-dir") + .arg(govbot_dir); + + let output = source_cmd.output()?; + if !output.status.success() { + let stderr_str = String::from_utf8_lossy(&output.stderr); + return Err(std::io::Error::other(format!( + "source exited with status {:?}: {}", + output.status.code(), + stderr_str + ))); + } + + let mut records = Vec::new(); + for line in output.stdout.split(|b| *b == b'\n') { + if line.is_empty() { + continue; + } + let v: serde_json::Value = match serde_json::from_slice(line) { + Ok(v) => v, + Err(_) => continue, // Best-effort — source itself logs the parse failure. + }; + let id = v + .get("id") + .and_then(|x| x.as_str()) + .unwrap_or("") + .to_string(); + let text_len = v + .get("text") + .and_then(|x| x.as_str()) + .map(|s| s.len()) + .unwrap_or(0); + // A record without an id will fail the per-dataset `unrouted` + // bucket (`parse_doc_route` returns None for the empty string), + // surfacing as a global routability failure. + records.push(DocRecord { id, text_len }); + } + Ok(records) +} + +/// Translate a doc id back to its on-disk `metadata.json` and confirm it +/// (a) exists, (b) parses as JSON, (c) has at least a `title` or `identifier` +/// field. The third leg is what would have caught 5ab6d3c — a dir-name vs +/// `log.bill_id` whitespace mismatch produces an id whose metadata.json +/// path simply doesn't exist on disk. +fn check_metadata_json(doc_id: &str, dataset_repo_dir: &Path) -> Result<(), String> { + let route = parse_doc_route(doc_id).ok_or_else(|| { + "id does not match expected `country:.../bills/` shape".to_string() + })?; + // Path: /country:/state:/sessions//bills//metadata.json + let metadata_path = dataset_repo_dir + .join(format!("country:{}", route.country)) + .join(format!("state:{}", route.state)) + .join("sessions") + .join(&route.session) + .join("bills") + .join(&route.bill_id) + .join("metadata.json"); + + if !metadata_path.exists() { + return Err(format!( + "metadata.json not found at {}", + metadata_path.display() + )); + } + let contents = fs::read_to_string(&metadata_path) + .map_err(|e| format!("cannot read {}: {}", metadata_path.display(), e))?; + let value: serde_json::Value = serde_json::from_str(&contents) + .map_err(|e| format!("invalid JSON in {}: {}", metadata_path.display(), e))?; + let has_title = value + .get("title") + .and_then(|v| v.as_str()) + .map(|s| !s.is_empty()) + .unwrap_or(false); + let has_identifier = value + .get("identifier") + .and_then(|v| v.as_str()) + .map(|s| !s.is_empty()) + .unwrap_or(false); + if !has_title && !has_identifier { + return Err(format!( + "metadata.json at {} has neither `title` nor `identifier`", + metadata_path.display() + )); + } + Ok(()) +} + +/// Human-readable doctor report. Per-dataset one-liners followed by the +/// global summary; failures get an indented detail block. +fn emit_doctor_text( + dataset_reports: &[DatasetReport], + total_records: usize, + total_distinct: usize, + unrouted: &[String], + broken_entries: &[String], + elapsed: std::time::Duration, + all_passed: bool, +) { + println!( + "govbot doctor — {} dataset(s), {} record(s), {} distinct id(s), {:.2}s", + dataset_reports.len(), + total_records, + total_distinct, + elapsed.as_secs_f64() + ); + println!(); + + for r in dataset_reports { + let status = if !r.passed() { + "FAIL" + } else if r.warned() { + "WARN" + } else { + "PASS" + }; + println!( + " [{}] {:<22} records={:<5} distinct={:<5} sampled={}", + status, r.dataset, r.record_count, r.distinct_ids, r.sampled + ); + for c in &r.checks { + if !c.passed { + println!(" - {}: FAIL", c.name); + for d in &c.detail { + println!(" • {}", d); + } + } else if c.warned { + println!(" - {}: WARN", c.name); + for d in &c.detail { + println!(" • {}", d); + } + } + } + } + + println!(); + if !broken_entries.is_empty() { + println!( + " [FAIL] global.dataset_links {} broken or non-dir entry/entries in repos/:", + broken_entries.len() + ); + for name in broken_entries.iter().take(MAX_FAIL_DETAIL) { + println!( + " • {} (likely a dangling symlink or non-directory)", + name + ); + } + if broken_entries.len() > MAX_FAIL_DETAIL { + println!( + " • ...and {} more", + broken_entries.len() - MAX_FAIL_DETAIL + ); + } + } else { + println!(" [PASS] global.dataset_links"); + } + + if !unrouted.is_empty() { + println!( + " [FAIL] global.routable_ids {} id(s) without a `/country:...` prefix:", + unrouted.len() + ); + for id in unrouted.iter().take(MAX_FAIL_DETAIL) { + println!(" • {}", id); + } + } else { + println!(" [PASS] global.routable_ids"); + } + + println!(); + if all_passed { + println!("doctor: PASS"); + } else { + println!("doctor: FAIL"); + } +} + +/// Machine-readable doctor report. Stable enough to pipe into CI. +fn emit_doctor_json( + dataset_reports: &[DatasetReport], + total_records: usize, + total_distinct: usize, + unrouted: &[String], + broken_entries: &[String], + elapsed: std::time::Duration, + all_passed: bool, +) { + let datasets: Vec = dataset_reports + .iter() + .map(|r| { + let checks: Vec = r + .checks + .iter() + .map(|c| { + serde_json::json!({ + "name": c.name, + "passed": c.passed, + "warned": c.warned, + "detail": c.detail, + }) + }) + .collect(); + serde_json::json!({ + "dataset": r.dataset, + "passed": r.passed(), + "record_count": r.record_count, + "distinct_ids": r.distinct_ids, + "sampled": r.sampled, + "checks": checks, + }) + }) + .collect(); + let report = serde_json::json!({ + "status": if all_passed { "pass" } else { "fail" }, + "elapsed_secs": elapsed.as_secs_f64(), + "total_records": total_records, + "total_distinct_ids": total_distinct, + "unrouted_ids": unrouted, + "broken_dataset_entries": broken_entries, + "datasets": datasets, + }); + println!("{}", serde_json::to_string_pretty(&report).unwrap()); +} + +#[tokio::main] +async fn main() -> anyhow::Result<()> { + let args = Args::parse(); + + match args.command { + Some(cmd @ Command::Pull { .. }) => run_pull_command(cmd).await, + Some(cmd @ Command::Delete { .. }) => run_delete_command(cmd).await, + Some(cmd @ Command::Source { .. }) => run_source_command(cmd).await, + Some(cmd @ Command::Load { .. }) => run_load_command(cmd).await, + Some(Command::Update) => run_update_command().await, + Some(cmd @ Command::Apply { .. }) => run_apply_command(cmd).await, + Some(cmd @ Command::Publish { .. }) => run_publish_command(cmd).await, + Some(Command::Run { + govbot_dir, + dry_run, + }) => { + let cwd = std::env::current_dir()?; + let config_path = cwd.join("govbot.yml"); + if !config_path.exists() { + anyhow::bail!( + "No govbot.yml in {}. Run `govbot init` to scaffold one, then `govbot run`.", + cwd.display() + ); + } + govbot::pipeline::run_pipeline(&config_path, govbot_dir.as_deref(), dry_run) + } + Some(Command::Init { + from_frankie_config, + into, + }) => { + // Migration path: --from-frankie-config bypasses the wizard and + // scaffolds from a Frankie-style topics//config.yml. The + // init_from_frankie module handles its own pre-flight checks + // (refusing to overwrite an existing govbot.yml in ). + if let Some(frankie_path) = from_frankie_config { + let into_path = into.map(std::path::PathBuf::from); + return govbot::init_from_frankie::run( + std::path::Path::new(&frankie_path), + into_path.as_deref(), + ); + } + + // Wizard / defaults path. `--into` is honored here too so a + // non-Frankie scaffold can target a directory other than cwd. + let into_provided = into.is_some(); + let target = match into { + Some(p) => { + let path = std::path::PathBuf::from(&p); + std::fs::create_dir_all(&path)?; + path + } + None => std::env::current_dir()?, + }; + let config_path = target.join("govbot.yml"); + if config_path.exists() { + eprintln!("govbot.yml already exists in {}.", target.display()); + return Ok(()); + } + // The interactive wizard always writes to cwd; only run it when + // the user did not pass --into (otherwise honor --into via the + // non-interactive default writer). + if !into_provided && std::io::IsTerminal::is_terminal(&std::io::stdin()) { + govbot::wizard::run_wizard() + } else { + govbot::wizard::write_default_files(&target) + } + } + Some(cmd @ Command::Add { .. }) => run_add_command(cmd), + Some(cmd @ Command::Remove { .. }) => run_remove_command(cmd), + Some(cmd @ Command::Ls { .. }) => run_ls_command(cmd), + Some(cmd @ Command::Search { .. }) => run_search_command(cmd), + Some(cmd @ Command::Doctor { .. }) => run_doctor_command(cmd), + Some(Command::Logs { + repos, + limit, + join, + select, + filter, + sort, + govbot_dir, + }) => { + // Deprecation warning MUST go to stderr — stdout is the + // bills.jsonl payload `govbot logs > bills.jsonl` consumers + // (the CHN-Bluesky-Govbot-Main framework) pipe to disk. + // Printing to stdout would corrupt the JSON-Lines stream. + eprintln!( + "warning: `govbot logs` is deprecated; use `govbot source` instead. The old form will be removed in a future major version." + ); + // Delegate to the canonical source handler with identical args. + run_source_command(Command::Source { + repos, + limit, + join, + select, + filter, + sort, + govbot_dir, + }) + .await + } + None => { + let cwd = std::env::current_dir()?; + let config_path = cwd.join("govbot.yml"); + if !config_path.exists() { + // Generate govbot.yml: interactive wizard or defaults + if std::io::IsTerminal::is_terminal(&std::io::stdin()) { + govbot::wizard::run_wizard()?; + } else { govbot::wizard::write_default_files(&cwd)?; } // Exit after generating config; user runs `govbot` again // to start the pipeline (matches the wizard's own message). return Ok(()); } - govbot::pipeline::run_pipeline(&config_path) + govbot::pipeline::run_pipeline(&config_path, None, false) } } } + +#[cfg(test)] +mod tests { + use super::*; + + /// A typical `govbot source --select docs` id — the leading dataset + /// `short_name` is what `govbot apply` uses to route the `.tag.json` under + /// `/tags//...` by default. + #[test] + fn parse_doc_route_extracts_dataset_prefix() { + let route = + parse_doc_route("wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001") + .expect("dataset path should parse"); + assert_eq!(route.dataset.as_deref(), Some("wy-legislation")); + assert_eq!(route.country, "us"); + assert_eq!(route.state, "wy"); + assert_eq!(route.session, "2025"); + assert_eq!(route.bill_id, "HB0001"); + } + + /// A doc id with no dataset prefix — `apply` falls back to the project + /// dir rather than dropping the record on the floor. + #[test] + fn parse_doc_route_handles_missing_dataset_prefix() { + let route = parse_doc_route("country:us/state:wy/sessions/2025/bills/HB0001") + .expect("dataset path without prefix should still parse"); + assert!(route.dataset.is_none()); + assert_eq!(route.bill_id, "HB0001"); + } + + /// A non-bill doc id (e.g. a future stream-kind) — `None` so `apply` + /// skips the record with a warning. + #[test] + fn parse_doc_route_rejects_non_bill_ids() { + assert!(parse_doc_route("just-some-other-id").is_none()); + assert!(parse_doc_route("wy-legislation/country:us").is_none()); + } + + /// The mock layout — logs already live under `bills//logs/` — so + /// stripping `/logs/...` from `sources.log` directly yields the bill + /// path. The `id` must be that full dataset-rooted bill path, ready + /// for `parse_doc_route` to find a `bills` segment and route the + /// `.tag.json` back to the correct bill. + #[test] + fn ocd_entry_to_doc_per_bill_log_layout_keeps_bill_suffix() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "ANY" } }, + "bill": { "title": "Mock bill", "identifier": "HB0001" }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250101T000000Z_foo.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert_eq!( + doc.get("id").and_then(|v| v.as_str()), + Some("wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001") + ); + // And it must round-trip through `parse_doc_route` — the contract + // `govbot apply` depends on. + assert_eq!( + parse_doc_route(doc.get("id").unwrap().as_str().unwrap()) + .expect("route") + .bill_id, + "HB0001" + ); + } + + /// REGRESSION (real-data bug): `govbot pull all` clones OCD-files-shaped + /// datasets whose on-disk logs live at `sessions//logs/.json` + /// as *symlinks* into per-bill `bills//logs/.json`. The walker + /// reports the symlink path, so `sources.log` does NOT contain `/bills/ + /// /` and the old `log_path.split("/logs/").next()` builder dropped + /// the bill_id, collapsing every bill in a session onto one id. Over the + /// 55-state corpus that compressed 4916 distinct bill records into 97 + /// session ids; `apply` then overwrote every tag file's `bills` map + /// repeatedly and the bluesky ledger silently marked one bill per + /// session as "done." The id must carry `/bills/` so each bill + /// hashes to a distinct slot. + #[test] + fn ocd_entry_to_doc_session_level_log_layout_appends_bill_id() { + let entry = serde_json::json!({ + "log": { "bill_id": "SB50", "action": { "description": "PASSED" } }, + "bill": { "title": "Mock bill", "identifier": "SB50" }, + "sources": { + // Realistic shape from `govbot pull ak`: session-level log + // path, no `/bills//` segment because the walker + // followed the symlink-source view, not the canonical + // target. + "log": "ak-legislation/country:us/state:ak/sessions/34/logs/20250317T000000Z.vote_event.pass.upper_SB50.json", + "bill": "../../../../.govbot/cache/ak-abc123/country:us/state:ak/sessions/34/bills/SB50/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert_eq!( + doc.get("id").and_then(|v| v.as_str()), + Some("ak-legislation/country:us/state:ak/sessions/34/bills/SB50"), + "id must include /bills/ for session-level log layouts" + ); + // The whole point: this id must round-trip through `parse_doc_route` + // so `govbot apply` keys per-bill, not per-session. + let route = parse_doc_route(doc.get("id").unwrap().as_str().unwrap()) + .expect("session-level layout must still produce a routable doc id"); + assert_eq!(route.bill_id, "SB50"); + assert_eq!(route.session, "34"); + } + + /// Two distinct bills from the same session must yield two distinct ids — + /// the precondition the apply layer and the bluesky publisher's ledger + /// rely on. This is the unit-level expression of the corpus check + /// `len(ids) == len(set(ids))`. + #[test] + fn ocd_entry_to_doc_distinct_bills_same_session_get_distinct_ids() { + let make = |bill_id: &str, log_file: &str| { + serde_json::json!({ + "log": { "bill_id": bill_id, "action": { "description": "PASSED" } }, + "bill": { "title": "Mock", "identifier": bill_id }, + "sources": { + "log": format!( + "ak-legislation/country:us/state:ak/sessions/34/logs/{}", + log_file + ), + "bill": format!( + "../../../../.govbot/cache/ak-x/country:us/state:ak/sessions/34/bills/{}/metadata.json", + bill_id + ) + } + }) + }; + let entries = vec![ + make("SB50", "20250317T000000Z.vote_event.pass.upper_SB50.json"), + make("HR2", "20250121T000000Z.vote_event.pass.lower_HR2.json"), + make("HJR20", "20250514T000000Z_h_fn1_zeroleg_HJR20.json"), + make("HB55", "20250306T000000Z_h_heard_held_HB55.json"), + ]; + let ids: Vec = entries + .iter() + .map(|e| { + ocd_entry_to_doc(e) + .get("id") + .and_then(|v| v.as_str()) + .unwrap() + .to_string() + }) + .collect(); + let unique: std::collections::HashSet<&String> = ids.iter().collect(); + assert_eq!( + ids.len(), + unique.len(), + "4 bills under one session must produce 4 distinct ids; got: {:?}", + ids + ); + } + + /// REGRESSION (real-data bug, 55-state corpus): MI/WV/ND/PA legislature + /// logs ship a `bill_id` field with a *display* space — e.g. + /// `"HB 5077"`, `"SB 0001"` — even though the corresponding on-disk + /// directory is `bills/HB5077/`, `bills/SB0001/` (no space). The + /// pre-fix `ocd_entry_to_doc` for the Layout-2 (session-level symlink) + /// case appended `log.bill_id` verbatim, producing ids like + /// `mi-legislation/.../bills/SB 0001`. Downstream consumers doing a + /// sibling `metadata.json` lookup via path joining + /// (`os.path.join(REPOS, doc, "metadata.json")`) then 404'd because no + /// such directory exists on disk. The architect saw "(no metadata.json)" + /// for ~30% of bills. + /// + /// The fix sources the `/bills/` segment from the resolved + /// `sources.bill` path (the parent dir of `metadata.json`, which the + /// `bill` join produced from the canonicalized log path) — that is the + /// authoritative on-disk dir name. The id must NOT contain whitespace + /// in the bill segment, and it must point to a directory that exists. + #[test] + fn ocd_entry_to_doc_uses_canonical_bill_dir_when_log_bill_id_has_whitespace() { + let entry = serde_json::json!({ + "log": { + // Display form with a space — this is what MI/WV/ND/PA emit. + "bill_id": "SB 0001", + "action": { "description": "PASSED" } + }, + "bill": { "title": "Mock", "identifier": "SB 0001" }, + "sources": { + // Session-level symlink layout (Layout 2). `sources.log` + // stops at the session because the walker reported the + // symlink, not the canonical target. + "log": "mi-legislation/country:us/state:mi/sessions/2025-2026/logs/20250108T000000Z_referred_to_committee_of_the_whole_SB0001.json", + // `sources.bill` points at the *resolved* on-disk + // metadata.json — the parent dir is the canonical bill dir + // name (no whitespace). + "bill": "../../../../.govbot/cache/mi-ad5ea7bbd548/country:us/state:mi/sessions/2025-2026/bills/SB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + let id = doc + .get("id") + .and_then(|v| v.as_str()) + .expect("doc id must be a string"); + // The id must end at the on-disk dir, not the display bill_id. + assert_eq!( + id, "mi-legislation/country:us/state:mi/sessions/2025-2026/bills/SB0001", + "id must use the canonical on-disk bill dir name (no whitespace)" + ); + // No whitespace anywhere in the id — that's what makes + // `os.path.join(REPOS, doc, \"metadata.json\")` resolve to a real + // file on a real filesystem. + assert!( + !id.contains(' '), + "id must not carry display-form whitespace; got: {}", + id + ); + } + + /// Same data shape, all four affected states (MI/WV/ND/PA) — pins that + /// the fix isn't accidentally specific to one state's path shape. + #[test] + fn ocd_entry_to_doc_uses_canonical_bill_dir_for_all_affected_states() { + // (display_bill_id, on_disk_dir, dataset, session, log_filename) + let cases = [ + ( + "SB 0001", + "SB0001", + "mi-legislation", + "mi", + "2025-2026", + "20250108T000000Z_referred_to_committee_of_the_whole_SB0001.json", + ), + ( + "SB 458", + "SB458", + "wv-legislation", + "wv", + "2025", + "20250307T000000Z_read_2nd_time_SB458.json", + ), + ( + "SB 2262", + "SB2262", + "nd-legislation", + "nd", + "69", + "20250501T000000Z_signed_by_governor_0429_SB2262.json", + ), + ( + "HB 1271", + "HB1271", + "pa-legislation", + "pa", + "2025-2026", + "20250421T040000Z_referred_to_education_HB1271.json", + ), + ]; + for (display_id, on_disk_dir, dataset, state, session, log_file) in cases { + let entry = serde_json::json!({ + "log": { "bill_id": display_id, "action": { "description": "PASSED" } }, + "bill": { "title": "Mock", "identifier": display_id }, + "sources": { + "log": format!( + "{}/country:us/state:{}/sessions/{}/logs/{}", + dataset, state, session, log_file + ), + "bill": format!( + "../../../../.govbot/cache/{}-deadbeef/country:us/state:{}/sessions/{}/bills/{}/metadata.json", + state, state, session, on_disk_dir + ) + } + }); + let doc = ocd_entry_to_doc(&entry); + let id = doc + .get("id") + .and_then(|v| v.as_str()) + .unwrap_or_default() + .to_string(); + assert_eq!( + id, + format!( + "{}/country:us/state:{}/sessions/{}/bills/{}", + dataset, state, session, on_disk_dir + ), + "{}: id must use the on-disk dir `{}`, not the log's display id `{}`", + state, + on_disk_dir, + display_id + ); + assert!( + !id.contains(' '), + "{}: id contains whitespace; got: {}", + state, + id + ); + // Round-trip: the route's bill_id must be the on-disk dir + // name, because that's what every downstream path lookup + // (`os.path.join(REPOS, doc, ...)`) is going to hit. + let route = + parse_doc_route(&id).expect("routable doc id even for spaced bill_id inputs"); + assert_eq!( + route.bill_id, on_disk_dir, + "{}: parsed bill_id must be the on-disk dir", + state + ); + } + } + + /// REGRESSION (real-data follow-on of the whitespace fix): MI/ND/PA + /// also publish a Layout-1 view for some bills — `sources.log` is + /// `.../sessions//bills//logs/.json` because + /// the walker happened to land on the per-bill log directly. In that + /// case the stripped path already ends in `/bills/` + /// (e.g. `bills/HR0163`). But `log.bill_id` is `"HR 0163"` (display + /// form). The pre-fix Layout-1 detector compared the stripped path's + /// suffix to `log.bill_id` verbatim, which DID NOT match (no space + /// vs space), so the code fell through to the Layout-2 branch and + /// appended `/bills/HR0163` *again*, producing + /// `mi-legislation/.../bills/HR0163/bills/HR0163`. Sample over the + /// 55-state corpus: ~50% of mi/nd/pa records exhibited the + /// doubled-bills id. The Layout-1 detector must therefore consider + /// both the canonical dir name (from `sources.bill`) and + /// `log.bill_id`; a match on either means the path already names + /// the bill. + #[test] + fn ocd_entry_to_doc_layout1_with_spaced_log_bill_id_does_not_double_bills_segment() { + let entry = serde_json::json!({ + "log": { + // Display form with a space — what MI/ND/PA emit. + "bill_id": "HR 0163", + "action": { "description": "ANY" } + }, + "bill": { "title": "Mock", "identifier": "HR 0163" }, + "sources": { + // Layout 1 — the walker landed on the per-bill log dir. + // The stripped path will end in `/bills/HR0163` (no space). + "log": "mi-legislation/country:us/state:mi/sessions/2025-2026/bills/HR0163/logs/20250101T000000Z_foo.json", + "bill": "../../../../.govbot/cache/mi-x/country:us/state:mi/sessions/2025-2026/bills/HR0163/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + let id = doc.get("id").and_then(|v| v.as_str()).unwrap_or_default(); + assert_eq!( + id, "mi-legislation/country:us/state:mi/sessions/2025-2026/bills/HR0163", + "Layout 1 with spaced log.bill_id must not double-append the /bills/ segment" + ); + // The cardinal symptom of the bug: a doubled `bills//bills/` tail. + assert!( + !id.contains("/bills/HR0163/bills/"), + "id must not double the bills segment; got: {}", + id + ); + assert!( + !id.contains(' '), + "id must not contain whitespace; got: {}", + id + ); + } + + /// `bill_dir_from_metadata_path` is the helper the fix relies on. Unit- + /// test the shape boundary so future refactors don't silently break it. + #[test] + fn bill_dir_from_metadata_path_extracts_dir_segment() { + assert_eq!( + bill_dir_from_metadata_path( + "../../../../.govbot/cache/mi-x/country:us/state:mi/sessions/2025-2026/bills/HB5109/metadata.json" + ), + Some("HB5109") + ); + assert_eq!( + bill_dir_from_metadata_path( + "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + ), + Some("HB0001") + ); + // Not a bill metadata path — refuse to guess. + assert_eq!( + bill_dir_from_metadata_path("country:us/state:wy/sessions/2025/metadata.json"), + None + ); + assert_eq!(bill_dir_from_metadata_path("metadata.json"), None); + assert_eq!(bill_dir_from_metadata_path(""), None); + } + + /// When the consumer ran `govbot source --select docs` *without* + /// `--join bill`, `sources.bill` is absent and we have no canonical + /// dir to lean on. Fall back to `log.bill_id` so the id is still + /// routable — even if it carries display-form whitespace. Document + /// that this is the advisory path; the production `source --select + /// docs` invocation always joins `bill`, so this branch only fires + /// for ad-hoc invocations. + #[test] + fn ocd_entry_to_doc_falls_back_to_log_bill_id_when_bill_join_absent() { + let entry = serde_json::json!({ + "log": { "bill_id": "SB 0001", "action": { "description": "PASSED" } }, + "sources": { + "log": "mi-legislation/country:us/state:mi/sessions/2025-2026/logs/20250108T000000Z_x.json" + // No `sources.bill` — `--join bill` was not requested. + } + }); + let doc = ocd_entry_to_doc(&entry); + assert_eq!( + doc.get("id").and_then(|v| v.as_str()), + Some("mi-legislation/country:us/state:mi/sessions/2025-2026/bills/SB 0001"), + "without sources.bill we fall back to log.bill_id (advisory; may carry whitespace)" + ); + } + + /// A4: OCD `subject:` arrays are gold-standard human classifications that + /// fastclass's future `concept_match` matcher reads. When the bill carries + /// a populated `subject:` list, the docs projection must surface it under + /// `subjects` so it travels with the rest of the bill text. + #[test] + fn ocd_entry_to_doc_surfaces_subjects_when_present() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "An act about clean energy", + "identifier": "HB0001", + "subject": ["ENERGY", "ENVIRONMENT", "TAXATION"] + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + let subjects = doc + .get("subjects") + .and_then(|v| v.as_array()) + .expect("subjects must be present and an array when bill carries subject:"); + let actual: Vec<&str> = subjects.iter().filter_map(|v| v.as_str()).collect(); + assert_eq!( + actual, + vec!["ENERGY", "ENVIRONMENT", "TAXATION"], + "subjects must mirror the OCD subject: array verbatim and in order" + ); + // The rest of the contract — id/text/kind — must be unaffected by + // the additive field. + assert_eq!(doc.get("kind").and_then(|v| v.as_str()), Some("docs")); + assert!( + doc.get("text") + .and_then(|v| v.as_str()) + .unwrap_or("") + .contains("clean energy"), + "existing text projection must still include the bill title" + ); + } + + /// A4: When the bill has no `subject:` key at all, the docs record must + /// have **no `subjects` key** (not `"subjects": []`). Many states omit + /// the OCD subject array entirely; conflating that with "explicitly + /// empty" would force the consumer to guess. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_bill_has_no_subject_key() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "An untagged bill", + "identifier": "HB0001" + // No subject: key at all. + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted entirely when bill has no subject: field; got: {:?}", + doc.get("subjects") + ); + } + + /// A4: An explicitly empty `subject: []` is treated the same as missing — + /// no `subjects` key in the output. WY's `HB0001` mock has `subject: []` + /// for example; we don't want every WY record to ship `"subjects": []` + /// just because the OCD scraper materialized an empty list. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_subject_array_is_empty() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "An empty-subjects bill", + "identifier": "HB0001", + "subject": [] + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted for explicit empty arrays — empty conflates with \ + absent and breaks the 'present means signal' contract; got: {:?}", + doc.get("subjects") + ); + } + + /// A4: A `subject:` array with only blank strings is treated as empty — + /// the trim-then-filter pass means whitespace-only entries don't make it + /// into the projection, and a list of all-blank entries omits the field. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_subject_array_is_all_blank() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "bill": { + "title": "A whitespace-only-subjects bill", + "identifier": "HB0001", + "subject": ["", " "] + }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json", + "bill": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json" + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted when every subject element is blank/whitespace" + ); + } + + /// A4: When the entry is a bare `log` record (no `--join bill`), + /// `subjects` cannot be derived — there's no bill metadata to read from. + /// The field must be omitted. This is the same fallback path as the id + /// resolution above; without the bill join we have no `subject:` source. + #[test] + fn ocd_entry_to_doc_omits_subjects_when_bill_join_absent() { + let entry = serde_json::json!({ + "log": { "bill_id": "HB0001", "action": { "description": "PASSED" } }, + "sources": { + "log": "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/x.json" + // No `sources.bill`, no `bill` join. + } + }); + let doc = ocd_entry_to_doc(&entry); + assert!( + doc.get("subjects").is_none(), + "subjects must be omitted when the bill metadata isn't joined into the entry" + ); + } + + /// `.govbot/` is the cache; tag files belong outside it in the project- + /// rooted `tags/` output dir. The resolver's primary candidate must + /// therefore be `/tags//country:.../state:.../sessions/ + /// /`, with the in-cache `/tags/` location kept only as a + /// read-only fallback for working trees mid-migration. This regression + /// pins both — Bug 1's revisit must not silently restore the cache as + /// the primary location. + #[test] + fn resolve_tags_dir_candidates_prefer_project_tags_then_cache_fallback() { + let tmp = tempfile::tempdir().expect("tempdir"); + let project = tmp.path().join("project"); + let session = project + .join(".govbot") + .join("repos") + .join("wy-legislation") + .join("country:us") + .join("state:wy") + .join("sessions") + .join("2025"); + let log_path = session + .join("bills") + .join("HB0001") + .join("logs") + .join("2025-01-15T12:00:00Z.json"); + fs::create_dir_all(log_path.parent().unwrap()).unwrap(); + fs::write(&log_path, "{}").unwrap(); + + let candidates = resolve_tags_dir_candidates(&log_path, &project); + // Primary is the project-rooted output dir. + assert_eq!( + candidates.first().expect("primary candidate"), + &project + .join("tags") + .join("wy-legislation") + .join("country:us") + .join("state:wy") + .join("sessions") + .join("2025"), + ); + // Fallback A is the Bug-6 in-cache layout — read-only for migration. + assert!(candidates.iter().any(|c| c == &session.join("tags"))); + // And critically: the cache is NOT the primary location. + assert_ne!(candidates.first().unwrap(), &session.join("tags")); + } + + /// A log file outside any dataset layout (no `bills/` ancestor) yields + /// no candidates, letting the caller fall back to the legacy cwd-rooted + /// lookup. + #[test] + fn resolve_tags_dir_candidates_empty_outside_dataset_layout() { + let tmp = tempfile::tempdir().expect("tempdir"); + let stray = tmp.path().join("loose").join("file.json"); + fs::create_dir_all(stray.parent().unwrap()).unwrap(); + fs::write(&stray, "{}").unwrap(); + assert!(resolve_tags_dir_candidates(&stray, tmp.path()).is_empty()); + } + + /// Dataset isolation — the whole reason the `` segment lives at + /// the top of `tags/`. Two datasets sharing a `country:us/state:xx` + /// jurisdiction must write the same-named tag file to *different* files + /// on disk, keyed by short_name, so a project tracking multiple + /// jurisdictions never has one dataset's classification clobber + /// another's. + #[test] + fn tag_paths_are_dataset_isolated() { + // Synthesise the per-dataset destinations the way `run_apply_command` + // does, against two short_names that share a country/state/session. + let project = std::path::PathBuf::from("/tmp/project"); + let tags_root = project.join("tags"); + + let short_a = "wy-legislation"; + let short_b = "wy-counties"; + let country = "country:us"; + let state = "state:wy"; + let session = "2025"; + let tag = "clean_energy"; + + let path_a = tags_root + .join(short_a) + .join(country) + .join(state) + .join("sessions") + .join(session) + .join(format!("{}.tag.json", tag)); + let path_b = tags_root + .join(short_b) + .join(country) + .join(state) + .join("sessions") + .join(session) + .join(format!("{}.tag.json", tag)); + + assert_ne!(path_a, path_b, "dataset prefix must split the tag file"); + // Both must share the `tags/` prefix — the project's + // classification-output dir — never `.govbot/`. + assert!(path_a.starts_with(&tags_root)); + assert!(path_b.starts_with(&tags_root)); + let govbot_cache = project.join(".govbot"); + assert!(!path_a.starts_with(&govbot_cache)); + assert!(!path_b.starts_with(&govbot_cache)); + } + + /// End-to-end of the helper: a tag file in the dataset-rooted `tags/` + /// dir produces a `{tag_name: score}` map for the bill it lists, and an + /// empty map for a bill it does not list. + #[test] + fn match_tags_in_dir_returns_scores_for_matching_bill() { + let tmp = tempfile::tempdir().expect("tempdir"); + let tags_dir = tmp.path().join("tags"); + fs::create_dir_all(&tags_dir).unwrap(); + let tag_file = serde_json::json!({ + "metadata": { + "last_run": "2025-01-15T12:00:00Z", + "model": "fastclass-test", + "tag_config_hash": "abc123" + }, + "tag_config": { + "name": "clean_energy" + }, + "bills": { + "HB0001": { + "text_hash": "deadbeef", + "score": { + "final_score": 0.92, + "base_embedding": null, + "example_similarity": null, + "keyword_match": [], + "negative_penalty": 0.0 + } + } + } + }); + fs::write(tags_dir.join("clean_energy.tag.json"), tag_file.to_string()).unwrap(); + + let matched = match_tags_in_dir(&tags_dir, "HB0001"); + assert_eq!(matched.len(), 1); + assert!(matched.contains_key("clean_energy")); + + let missing = match_tags_in_dir(&tags_dir, "HB9999"); + assert!(missing.is_empty()); + + // Missing dir is not an error — callers chain dataset-rooted then + // cwd-rooted lookups, and a non-existent dir is the common case. + let absent = match_tags_in_dir(&tmp.path().join("no-such-dir"), "HB0001"); + assert!(absent.is_empty()); + } + + // ----------------------------------------------------------------- + // `govbot doctor` — the corpus-level smoke test. The full end-to-end + // path is exercised against real pulled data (see commit message for + // run details); these unit tests pin the failure-detection legs that + // would have caught bugs 7592418 and 5ab6d3c. + // ----------------------------------------------------------------- + + /// The metadata.json check is the leg that would have flagged 5ab6d3c + /// — a doc id whose dir-name was wrong (display form, with whitespace) + /// resolves to a non-existent metadata.json path. + #[test] + fn doctor_check_metadata_json_flags_missing_file() { + let tmp = tempfile::TempDir::new().unwrap(); + let dataset_dir = tmp.path().join("mi-legislation"); + let bill_dir = dataset_dir + .join("country:us") + .join("state:mi") + .join("sessions") + .join("2025-2026_103rd_Legislature") + .join("bills") + .join("HB4027"); + fs::create_dir_all(&bill_dir).unwrap(); + // Write a well-formed metadata.json — happy path. + fs::write( + bill_dir.join("metadata.json"), + serde_json::to_string(&serde_json::json!({ + "title": "An Act…", + "identifier": "HB 4027", + })) + .unwrap(), + ) + .unwrap(); + + // A clean id resolves. + let good_id = + "mi-legislation/country:us/state:mi/sessions/2025-2026_103rd_Legislature/bills/HB4027"; + assert!(check_metadata_json(good_id, &dataset_dir).is_ok()); + + // The exact pre-5ab6d3c failure: log.bill_id `"HB 4027"` (with + // whitespace) bleeds into the doc id, and the on-disk dir is + // `HB4027` — so the metadata.json path doesn't exist. + let broken_id = + "mi-legislation/country:us/state:mi/sessions/2025-2026_103rd_Legislature/bills/HB 4027"; + let err = check_metadata_json(broken_id, &dataset_dir).unwrap_err(); + assert!( + err.contains("not found"), + "expected 'not found' in error, got: {}", + err + ); + } + + /// metadata.json present but lacking both `title` and `identifier` — + /// counts as a fail. This catches stub/empty-bill clones where the + /// scraper landed but populated nothing usable. + #[test] + fn doctor_check_metadata_json_requires_title_or_identifier() { + let tmp = tempfile::TempDir::new().unwrap(); + let dataset_dir = tmp.path().join("wy-legislation"); + let bill_dir = dataset_dir + .join("country:us") + .join("state:wy") + .join("sessions") + .join("2025") + .join("bills") + .join("HB0001"); + fs::create_dir_all(&bill_dir).unwrap(); + fs::write( + bill_dir.join("metadata.json"), + // Neither title nor identifier — both empty / absent. + serde_json::to_string(&serde_json::json!({"description": "..."})).unwrap(), + ) + .unwrap(); + + let id = "wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001"; + let err = check_metadata_json(id, &dataset_dir).unwrap_err(); + assert!(err.contains("neither `title` nor `identifier`")); + } + + /// `dataset_short_name` is the only place where the dataset prefix + /// in a doc id (`-legislation`) and the short_name returned by + /// `get_local_datasets` (``) meet. Getting this wrong silently + /// breaks per-dataset bucketing — every dataset shows zero coverage + /// even though records were emitted. Pin both common suffixes. + #[test] + fn doctor_dataset_short_name_strips_known_suffixes() { + assert_eq!(dataset_short_name("nj-legislation", "-legislation"), "nj"); + assert_eq!(dataset_short_name("usa-legislation", "-legislation"), "usa"); + // Legacy `-data-pipeline` layout — strip it too. + assert_eq!(dataset_short_name("wy-data-pipeline", "-legislation"), "wy"); + // Custom suffix from GOVBOT_REPO_SUFFIX is honoured. + assert_eq!(dataset_short_name("nj-pkg", "-pkg"), "nj"); + // Bare short_name (no suffix at all) passes through. + assert_eq!(dataset_short_name("wy", "-legislation"), "wy"); + } + + /// metadata.json is unreadable JSON — that's still a fail (we can't + /// trust a record whose bill metadata won't even parse). + #[test] + fn doctor_check_metadata_json_flags_unparseable() { + let tmp = tempfile::TempDir::new().unwrap(); + let dataset_dir = tmp.path().join("ca-legislation"); + let bill_dir = dataset_dir + .join("country:us") + .join("state:ca") + .join("sessions") + .join("2025-2026") + .join("bills") + .join("AB100"); + fs::create_dir_all(&bill_dir).unwrap(); + fs::write(bill_dir.join("metadata.json"), b"{ this is not json").unwrap(); + let id = "ca-legislation/country:us/state:ca/sessions/2025-2026/bills/AB100"; + let err = check_metadata_json(id, &dataset_dir).unwrap_err(); + assert!(err.contains("invalid JSON")); + } +} diff --git a/actions/govbot/src/pipeline.rs b/actions/govbot/src/pipeline.rs index f744cca1..f61ccdd7 100644 --- a/actions/govbot/src/pipeline.rs +++ b/actions/govbot/src/pipeline.rs @@ -1,137 +1,496 @@ +use crate::config::{Command_, Manifest, Transform}; +use crate::git::repo_dir_name; use anyhow::{Context, Result}; -use std::path::Path; +use std::collections::HashMap; +use std::path::{Path, PathBuf}; use std::process::{Command, Stdio}; -/// Run the full govbot pipeline: clone/update → tag → build. +/// Run the full govbot pipeline against the project's `govbot.yml`. /// -/// Smart update behavior: -/// - If `.govbot/repos/` exists with repos: just update existing repos (git pull) -/// - If `.govbot/repos/` does not exist: clone repos based on govbot.yml config -pub fn run_pipeline(config_path: &Path) -> Result<()> { - let govbot_bin = std::env::current_exe() - .context("Failed to determine govbot binary path")?; - - let cwd = config_path - .parent() - .unwrap_or_else(|| Path::new(".")); - - let repos_dir = cwd.join(".govbot").join("repos"); +/// Stages: +/// 1. **pull/update** — clone or git-pull the manifest's `datasets`. +/// 2. **classify+apply** — the transform DAG: stream `source --select docs` +/// into each declared transform (an external process speaking the govbot +/// stream protocol) and pipe the final transform's output into +/// `govbot apply`. +/// 3. **publish** — run `govbot publish` to emit the manifest's publishers. +/// +/// `dry_run` is passed through to step 3 so publishers render but do not +/// emit; the `bluesky` publisher in particular honours it by touching no +/// network and no ledger. +/// +/// Smart update behavior: if `/repos/` already has datasets, just +/// `git pull`; otherwise clone the manifest's `datasets`. +pub fn run_pipeline(config_path: &Path, govbot_dir: Option<&str>, dry_run: bool) -> Result<()> { + let govbot_bin = std::env::current_exe().context("Failed to determine govbot binary path")?; + + let cwd = config_path.parent().unwrap_or_else(|| Path::new(".")); + + let manifest = Manifest::load(config_path)?; + + // The transforms govbot runs in step 2. If the manifest declares no + // pipeline, fall back to the classic single classify-transform DAG (a + // `fastclass classify` stage with the classifier bundle at `.`). + let transforms = resolve_pipeline_transforms(&manifest)?; + + // Fast-fail if a transform's binary cannot be resolved. + let resolved: Vec<(String, ResolvedTransform)> = transforms + .iter() + .map(|(name, t)| resolve_transform(t).map(|r| (name.clone(), r))) + .collect::>()?; + + // Resolve the repos directory the way subcommands do. + let repos_dir = match govbot_dir { + Some(d) => Path::new(d).join("repos"), + None => cwd.join(".govbot").join("repos"), + }; let has_repos = repos_dir.exists() && std::fs::read_dir(&repos_dir) .map(|mut d| d.next().is_some()) .unwrap_or(false); - // Step 1: Clone or update repos - eprintln!(); - eprintln!("=== Step 1/3: {} repositories ===", if has_repos { "Updating" } else { "Cloning" }); - eprintln!(); + // Classify each manifest dataset: is the project-local seed already + // populated for it? If every declared dataset has a non-empty + // project-local directory under `repos/`, the pull substep is a no-op — + // skip it (and the shared `~/.govbot/cache/` write the registry-driven + // pull would attempt) so a sandbox / read-only HOME does not error out a + // run that has all the data it needs sitting right there. + let locally_seeded: Vec<&String> = manifest + .datasets + .iter() + .filter(|name| name.as_str() != "all" && is_local_seed(&repos_dir, name)) + .collect(); + let all_locally_seeded = + !manifest.datasets.is_empty() && locally_seeded.len() == manifest.datasets.len(); - let clone_status = if has_repos { - // Update existing repos only - Command::new(&govbot_bin) - .arg("clone") - .current_dir(cwd) - .stdin(Stdio::inherit()) - .stdout(Stdio::inherit()) - .stderr(Stdio::inherit()) - .status() - } else { - // First run: clone based on config - let config = crate::publish::load_config(config_path)?; - let repos = crate::publish::get_repos_from_config(&config); - - let mut cmd = Command::new(&govbot_bin); - cmd.arg("clone"); - for repo in &repos { - cmd.arg(repo); + // Step 1: pull or update datasets. + eprintln!(); + eprintln!( + "=== Step 1/3: {} datasets ===", + if all_locally_seeded { + "Using local seed for" + } else if has_repos { + "Updating" + } else { + "Pulling" } - cmd.current_dir(cwd) - .stdin(Stdio::inherit()) - .stdout(Stdio::inherit()) - .stderr(Stdio::inherit()) - .status() - }; + ); + eprintln!(); - match clone_status { - Ok(status) if !status.success() => { - eprintln!("⚠️ Clone/update had errors (continuing anyway)"); + if all_locally_seeded { + for name in &locally_seeded { + let seed = repos_dir.join(seed_dir_name(name)); + eprintln!("📂 using local seed: {}", seed.display()); } - Err(e) => { - eprintln!("⚠️ Failed to run clone: {} (continuing anyway)", e); + // Skip the cache-touching pull subprocess entirely. + } else { + let pull_status = { + let mut cmd = Command::new(&govbot_bin); + cmd.arg("pull"); + if !has_repos { + // Initial pull: clone the manifest's datasets. + for dataset in &manifest.datasets { + cmd.arg(dataset); + } + } + if let Some(d) = govbot_dir { + cmd.arg("--govbot-dir").arg(d); + } + cmd.current_dir(cwd) + .stdin(Stdio::inherit()) + .stdout(Stdio::inherit()) + .stderr(Stdio::inherit()) + .status() + }; + match pull_status { + Ok(status) if !status.success() => { + eprintln!("⚠️ Pull/update had errors (continuing anyway)"); + } + Err(e) => { + eprintln!("⚠️ Failed to run pull: {} (continuing anyway)", e); + } + _ => {} } - _ => {} } - // Step 2: Tag bills (govbot logs | govbot tag) + // Step 2: run the transform DAG (source | transform... | apply). + // + // The source stage must honour the manifest's `datasets:` scope — without + // it, `govbot source` walks every linked dataset under `.govbot/repos/` + // (which can include datasets pulled by an earlier `datasets: [all]` run + // and never cleaned up), classifying tens of thousands of records the + // current manifest did not declare. + let source_repos = source_repos_from_manifest(&manifest.datasets); eprintln!(); - eprintln!("=== Step 2/3: Tagging bills ==="); + eprintln!("=== Step 2/3: Running transforms (source | ... | apply) ==="); eprintln!(); - - let tag_result = run_logs_pipe_tag(&govbot_bin, cwd); - match tag_result { + match run_transform_dag(&govbot_bin, &resolved, cwd, govbot_dir, &source_repos) { Ok(false) => { - eprintln!("⚠️ Tagging had errors (continuing anyway)"); + eprintln!("⚠️ Transform stage had errors (continuing anyway)"); } Err(e) => { - eprintln!("⚠️ Failed to run tagging: {} (continuing anyway)", e); + eprintln!("⚠️ Failed to run transforms: {} (continuing anyway)", e); } _ => {} } - // Step 3: Build RSS feeds + // Step 3: publish. eprintln!(); - eprintln!("=== Step 3/3: Building RSS feeds ==="); + eprintln!("=== Step 3/3: Publishing ==="); eprintln!(); - - let build_status = Command::new(&govbot_bin) - .arg("build") + let mut publish_cmd = Command::new(&govbot_bin); + publish_cmd.arg("publish"); + if let Some(d) = govbot_dir { + publish_cmd.arg("--govbot-dir").arg(d); + } + if dry_run { + publish_cmd.arg("--dry-run"); + } + let publish_status = publish_cmd .current_dir(cwd) .stdin(Stdio::inherit()) .stdout(Stdio::inherit()) .stderr(Stdio::inherit()) .status() - .context("Failed to run govbot build")?; - - if !build_status.success() { - anyhow::bail!("Build step failed with exit code: {}", build_status.code().unwrap_or(-1)); + .context("Failed to run govbot publish")?; + if !publish_status.success() { + anyhow::bail!( + "Publish step failed with exit code: {}", + publish_status.code().unwrap_or(-1) + ); } eprintln!(); eprintln!("Pipeline complete!"); - Ok(()) } -/// Run `govbot logs | govbot tag` by piping stdout of logs into stdin of tag. -/// Returns Ok(true) if both succeeded, Ok(false) if either failed. -fn run_logs_pipe_tag(govbot_bin: &Path, cwd: &Path) -> Result { - let mut logs_child = Command::new(govbot_bin) - .arg("logs") +/// Resolve which transforms `govbot run` executes. +/// +/// If the manifest declares pipelines, the first pipeline's stages that name a +/// `transforms:` entry are run, in order. (Publisher stages are handled by the +/// separate `publish` step.) If no pipeline / no transforms are declared, fall +/// back to a single `fastclass classify` transform with the classifier bundle +/// at `.` (the project directory). +fn resolve_pipeline_transforms(manifest: &Manifest) -> Result> { + // Prefer an explicit pipeline; pick the first one deterministically. + if let Some((_, stages)) = manifest.pipelines.iter().next() { + let mut out = Vec::new(); + for stage in stages { + if let Some(t) = manifest.transforms.get(stage) { + out.push((stage.clone(), t.clone())); + } + // A stage naming a publisher is handled by the publish step; + // a stage naming neither is a manifest error surfaced elsewhere. + } + if !out.is_empty() { + return Ok(out); + } + } + + // No pipeline transforms: run every declared transform, in name order. + if !manifest.transforms.is_empty() { + return Ok(manifest + .transforms + .iter() + .map(|(k, v)| (k.clone(), v.clone())) + .collect()); + } + + // Nothing declared — the classic single classify transform. The classifier + // bundle defaults to `.` (the project directory holding the bundle). + Ok(vec![( + "classify".to_string(), + Transform { + command: Command_::Argv(vec![ + "fastclass".to_string(), + "classify".to_string(), + "-".to_string(), + ]), + reads: "docs".to_string(), + writes: "classification".to_string(), + classifier: Some(".".to_string()), + }, + )]) +} + +/// A transform whose binary has been resolved to an absolute path, with its +/// full argv assembled (including any `classifier=` argument). +struct ResolvedTransform { + /// The resolved executable path. + bin: PathBuf, + /// Arguments passed after the executable. + args: Vec, +} + +/// Resolve a transform's command to an executable + argv. +/// +/// The first argv element is the binary, resolved against `$PATH` and the +/// standard install locations (`~/.cargo/bin`, `~/.govbot/bin`). For a +/// classify-style transform the `classifier=` field is appended as an +/// explicit argument — NOT hard-coded to the cwd. +fn resolve_transform(t: &Transform) -> Result { + let argv = t.command.argv(); + let (bin_name, rest) = argv.split_first().context("transform `command` is empty")?; + + let bin = resolve_transform_binary(bin_name).ok_or_else(|| { + anyhow::anyhow!( + "transform binary `{}` not found on PATH, ~/.cargo/bin, or ~/.govbot/bin.\n\ + For the bundled classify transform, install fastclass:\n\ + cd && just install (or: cargo install --path .)", + bin_name + ) + })?; + + let mut args: Vec = rest.to_vec(); + // Append the explicit classifier bundle path for classify-style transforms. + if let Some(classifier) = &t.classifier { + args.push(format!("classifier={}", classifier)); + } + + Ok(ResolvedTransform { bin, args }) +} + +/// The user's home directory, from `$HOME` (Unix) or `%USERPROFILE%` (Windows). +fn home_dir() -> Option { + std::env::var_os("HOME") + .or_else(|| std::env::var_os("USERPROFILE")) + .map(PathBuf::from) + .filter(|p| !p.as_os_str().is_empty()) +} + +/// Resolve a transform binary by name: `$PATH` first, then the standard install +/// locations (`~/.cargo/bin`, `~/.govbot/bin`). An absolute/relative path that +/// already exists is used as-is. This generalizes the old `find_fastclass()`. +fn resolve_transform_binary(name: &str) -> Option { + // An explicit path component — use it directly if it resolves. + if name.contains('/') || name.contains('\\') { + let p = PathBuf::from(name); + return p.is_file().then_some(p); + } + + let exe = if cfg!(windows) && !name.ends_with(".exe") { + format!("{}.exe", name) + } else { + name.to_string() + }; + + if let Ok(path) = std::env::var("PATH") { + if let Some(hit) = std::env::split_paths(&path) + .map(|p| p.join(&exe)) + .find(|p| p.is_file()) + { + return Some(hit); + } + } + let home = home_dir()?; + [".cargo/bin", ".govbot/bin"] + .into_iter() + .map(|d| home.join(d).join(&exe)) + .find(|p| p.is_file()) +} + +/// Run the transform DAG: `govbot source --select docs | | | ... | +/// govbot apply`. +/// +/// A **linear executor** — each transform is an external process speaking the +/// govbot stream protocol (newline-delimited JSON, `{id,text,kind}` in, +/// results out). Output of stage N is piped to the stdin of stage N+1. The +/// `transforms:`/`pipelines:` schema is DAG-capable; this runner walks it +/// linearly, which is sufficient for the single-classifier pipeline today. +/// +/// Returns `Ok(true)` when every stage exits successfully. +fn run_transform_dag( + govbot_bin: &Path, + transforms: &[(String, ResolvedTransform)], + cwd: &Path, + govbot_dir: Option<&str>, + source_repos: &[String], +) -> Result { + // Stage 0: the source — `govbot source --select docs`. Scope it to the + // manifest's declared datasets (an empty list means "every linked + // dataset", matching the standalone `govbot source` default). + let mut source_cmd = Command::new(govbot_bin); + source_cmd.arg("source").arg("--select").arg("docs"); + if let Some(d) = govbot_dir { + source_cmd.arg("--govbot-dir").arg(d); + } + if !source_repos.is_empty() { + source_cmd.arg("--repos"); + for d in source_repos { + source_cmd.arg(d); + } + } + let mut source_child = source_cmd .current_dir(cwd) .stdout(Stdio::piped()) .stderr(Stdio::inherit()) .spawn() - .context("Failed to spawn govbot logs")?; - - let logs_stdout = logs_child + .context("Failed to spawn govbot source")?; + let mut prev_stdout: Stdio = source_child .stdout .take() - .context("Failed to capture logs stdout")?; + .context("Failed to capture source stdout")? + .into(); + + // Each transform stage reads the previous stage's stdout. + let mut transform_children = Vec::new(); + for (name, t) in transforms { + let mut child = Command::new(&t.bin) + .args(&t.args) + .current_dir(cwd) + .stdin(prev_stdout) + .stdout(Stdio::piped()) + .stderr(Stdio::inherit()) + .spawn() + .with_context(|| format!("Failed to spawn transform '{}'", name))?; + prev_stdout = child + .stdout + .take() + .with_context(|| format!("Failed to capture stdout of transform '{}'", name))? + .into(); + transform_children.push(child); + } - let tag_child = Command::new(govbot_bin) - .arg("tag") + // The sink: `govbot apply` consumes the final transform's result stream. + let apply_child = Command::new(govbot_bin) + .arg("apply") .current_dir(cwd) - .stdin(logs_stdout) + .stdin(prev_stdout) .stdout(Stdio::inherit()) .stderr(Stdio::inherit()) .spawn() - .context("Failed to spawn govbot tag")?; + .context("Failed to spawn govbot apply")?; - let tag_output = tag_child + // Wait downstream-to-upstream so pipes drain. + let apply_output = apply_child .wait_with_output() - .context("Failed to wait for govbot tag")?; + .context("Failed to wait for govbot apply")?; + let mut all_ok = apply_output.status.success(); + let mut statuses: HashMap = HashMap::new(); + for (child, (name, _)) in transform_children.iter_mut().zip(transforms.iter()) { + let status = child + .wait() + .with_context(|| format!("Failed to wait for transform '{}'", name))?; + statuses.insert(name.clone(), status.success()); + all_ok &= status.success(); + } + let source_status = source_child + .wait() + .context("Failed to wait for govbot source")?; + all_ok &= source_status.success(); + + Ok(all_ok) +} - let logs_status = logs_child.wait().context("Failed to wait for govbot logs")?; +/// Map a manifest dataset id to the on-disk directory name under `repos/`. +/// +/// A manifest id can be a bare jurisdiction code (`wy`) — which the registry +/// resolves to a `short_name`, then `repo_dir_name` suffixes (`wy-legislation`) +/// — or it can already match the on-disk dir name. We try the suffixed form +/// first; the raw id is the fallback for the (rare) namespaced-id case. +fn seed_dir_name(manifest_id: &str) -> String { + // Strip a `namespace/` prefix (`us-legislation/wy` -> `wy`) so the + // suffixed form matches `wy-legislation`. + let bare = manifest_id.rsplit('/').next().unwrap_or(manifest_id); + repo_dir_name(bare) +} - Ok(logs_status.success() && tag_output.status.success()) +/// True when `//` (or the raw name) exists and +/// has at least one entry. The directory walks `govbot source` does for the +/// dataset will succeed iff this is the case. +fn is_local_seed(repos_dir: &Path, manifest_id: &str) -> bool { + let candidate1 = repos_dir.join(seed_dir_name(manifest_id)); + let candidate2 = repos_dir.join(manifest_id); + [candidate1, candidate2] + .into_iter() + .any(|p| dir_has_entries(&p)) +} + +/// True when `p` is a directory (or a symlink resolving to one) with at least +/// one child entry. +fn dir_has_entries(p: &Path) -> bool { + std::fs::read_dir(p) + .map(|mut it| it.next().is_some()) + .unwrap_or(false) +} + +/// Translate the manifest's `datasets:` list to the `--repos` argv that +/// scopes the `govbot source` stage inside `run_transform_dag`. +/// +/// `datasets: [all]` becomes an empty list — `govbot source`'s own sentinel +/// for "every linked dataset", omitted from the argv so the flag is absent. +/// Any other list is passed through verbatim; `govbot source --repos ` +/// then walks only the named datasets. +/// +/// This is the load-bearing piece of [`run_pipeline`]'s step 2: forgetting +/// to pass `--repos` here caused a bug in which a manifest declaring +/// `datasets: [wy]` still classified ~4900 records across 52 states because +/// the cache held datasets from an earlier `[all]` pull. +fn source_repos_from_manifest(datasets: &[String]) -> Vec { + if datasets == ["all"] { + Vec::new() + } else { + datasets.to_vec() + } +} + +#[cfg(test)] +mod tests { + use super::*; + + /// Regression test for the `datasets:[wy]` scope leak: an `[all]` + /// manifest must produce an empty argv list (so `--repos` is omitted — + /// `govbot source`'s sentinel for "every linked dataset"), but any other + /// list must pass through verbatim so `govbot source --repos ` + /// scopes the walk. Pre-fix, `run_transform_dag` never passed `--repos`, + /// so the manifest's `datasets:` was silently ignored at the source step + /// and a `[wy]` manifest still classified ~4900 records across 52 states. + #[test] + fn source_repos_from_manifest_translates_all_and_scopes() { + assert_eq!( + source_repos_from_manifest(&["all".to_string()]), + Vec::::new(), + "`[all]` must collapse to empty so --repos is omitted" + ); + assert_eq!( + source_repos_from_manifest(&["wy".to_string()]), + vec!["wy".to_string()], + "`[wy]` must pass through verbatim" + ); + assert_eq!( + source_repos_from_manifest(&["wy".to_string(), "il".to_string()]), + vec!["wy".to_string(), "il".to_string()], + "`[wy, il]` must pass through verbatim" + ); + // An `[all, wy]` mix is not the `[all]` sentinel — pass through so + // the source step at least scopes to the named subset (and treats + // the literal `all` as a possibly-missing dataset id, surfacing the + // manifest error rather than silently widening to every dataset). + assert_eq!( + source_repos_from_manifest(&["all".to_string(), "wy".to_string()]), + vec!["all".to_string(), "wy".to_string()], + ); + } + + /// `govbot run` should detect a project-local dataset seed + /// (`.govbot/repos//`) and skip the cache-touching pull substep. + /// We test the detector — the substep skip itself is exercised by the + /// integration repro in the bug 3 PR description. + #[test] + fn is_local_seed_detects_populated_dir() { + let tmp = tempfile::tempdir().expect("tmpdir"); + let repos = tmp.path(); + // Empty repos/: no seed. + assert!(!is_local_seed(repos, "wy")); + + // Create the expected dataset dir with a file inside. + let seed = repos.join("wy-legislation"); + std::fs::create_dir_all(&seed).unwrap(); + std::fs::write(seed.join("data.json"), b"{}").unwrap(); + assert!(is_local_seed(repos, "wy")); + + // Namespaced id — still finds the suffixed dir. + assert!(is_local_seed(repos, "us-legislation/wy")); + } } diff --git a/actions/govbot/src/processor.rs b/actions/govbot/src/processor.rs index b2bd5052..4e46b1d0 100644 --- a/actions/govbot/src/processor.rs +++ b/actions/govbot/src/processor.rs @@ -1,10 +1,7 @@ use crate::config::Config; use crate::error::{Error, Result}; use crate::git; -use crate::types::{ - FileWithTimestamp, LogContent, LogEntry, Metadata, - VoteEventResult, -}; +use crate::types::{FileWithTimestamp, LogContent, LogEntry, Metadata, VoteEventResult}; use async_stream::stream; use futures::Stream; use jwalk::WalkDir; @@ -78,15 +75,27 @@ impl PipelineProcessor { config .repos .iter() - .map(|repo| search_dir.join(git::build_repo_name(repo))) + .map(|repo| { + // A dataset identifier may be namespaced; the clone + // directory is keyed on the short (slash-free) name. + let short = repo.rsplit('/').next().unwrap_or(repo); + let short = short.split('@').next().unwrap_or(short); + search_dir.join(git::repo_dir_name(short)) + }) .collect() }; for search_path in search_paths { if !search_path.exists() { - eprintln!("Warning: Expected repository directory does not exist: {}", search_path.display()); + eprintln!( + "Warning: Expected repository directory does not exist: {}", + search_path.display() + ); continue; } + // A project's repo entry may be a symlink into the shared dataset + // cache; jwalk reads through the root symlink transparently and + // reports child paths under `search_dir`, keeping them relative. // Use jwalk for fast parallel traversal // jwalk uses rayon internally for parallel processing @@ -204,29 +213,37 @@ impl PipelineProcessor { /// Calculate relative path from search directory fn calculate_relative_path(path: &Path, search_dir: &Path) -> Result { let search_dir_abs = search_dir.canonicalize().map_err(|_| { - Error::Path(format!("Failed to canonicalize search directory: {}", search_dir.display())) + Error::Path(format!( + "Failed to canonicalize search directory: {}", + search_dir.display() + )) })?; - - let path_abs = path.parent() - .ok_or_else(|| Error::Path(format!("Failed to get parent of path: {}", path.display())))? + + let path_abs = path + .parent() + .ok_or_else(|| { + Error::Path(format!("Failed to get parent of path: {}", path.display())) + })? .canonicalize() - .map_err(|_| { - Error::Path(format!("Failed to canonicalize path: {}", path.display())) - })?; + .map_err(|_| Error::Path(format!("Failed to canonicalize path: {}", path.display())))?; let relative = pathdiff::diff_paths(&path_abs, &search_dir_abs) .ok_or_else(|| Error::Path("Failed to calculate relative path".to_string()))?; // Reconstruct the full relative path including the filename - let filename = path.file_name() + let filename = path + .file_name() .ok_or_else(|| Error::Path(format!("Failed to get filename: {}", path.display())))?; - + Ok(relative.join(filename).to_string_lossy().to_string()) } /// Sort files by timestamp according to sort order /// Uses relative_path as a secondary sort key to ensure deterministic ordering - fn sort_files_internal(config: &Config, mut files: Vec) -> Vec { + fn sort_files_internal( + config: &Config, + mut files: Vec, + ) -> Vec { match config.sort_order { crate::config::SortOrder::Descending => { files.sort_by(|a, b| { @@ -271,7 +288,10 @@ impl PipelineProcessor { } /// Apply limit to files - fn apply_limit_internal(config: &Config, files: Vec) -> Vec { + fn apply_limit_internal( + config: &Config, + files: Vec, + ) -> Vec { if let Some(limit) = config.limit { files.into_iter().take(limit).collect() } else { @@ -280,7 +300,10 @@ impl PipelineProcessor { } /// Process a single file and return a log entry - async fn process_file_internal(config: &Config, file: &FileWithTimestamp) -> Result> { + async fn process_file_internal( + config: &Config, + file: &FileWithTimestamp, + ) -> Result> { // Check if it's a vote event file let is_vote_event = file.relative_path.contains(".vote_event."); @@ -292,7 +315,10 @@ impl PipelineProcessor { } /// Process a vote event file - async fn process_vote_event_file_internal(_config: &Config, file: &FileWithTimestamp) -> Result> { + async fn process_vote_event_file_internal( + _config: &Config, + file: &FileWithTimestamp, + ) -> Result> { // Extract vote event result from filename let vote_event_regex = Regex::new(r"\.vote_event\.([^.]+)\.")?; let result = vote_event_regex @@ -312,7 +338,10 @@ impl PipelineProcessor { } /// Process a regular (non-vote-event) file - async fn process_regular_file_internal(_config: &Config, file: &FileWithTimestamp) -> Result> { + async fn process_regular_file_internal( + _config: &Config, + file: &FileWithTimestamp, + ) -> Result> { // Read and parse JSON content let json_content = tokio::fs::read_to_string(&file.path).await?; let log_value: serde_json::Value = serde_json::from_str(&json_content)?; @@ -358,4 +387,3 @@ impl PipelineProcessor { Ok(Some(metadata)) } } - diff --git a/actions/govbot/src/publish.rs b/actions/govbot/src/publish.rs index 6f9edb31..8637885a 100644 --- a/actions/govbot/src/publish.rs +++ b/actions/govbot/src/publish.rs @@ -1,31 +1,252 @@ +use crate::config::{Manifest, Publisher, PublisherKind}; use crate::rss; use anyhow::{Context, Result}; use serde_json::Value; use std::collections::HashSet; use std::fs; -use std::path::Path; - -/// Load and parse govbot.yml configuration -pub fn load_config(config_path: &Path) -> Result { - let contents = fs::read_to_string(config_path) - .with_context(|| format!("Failed to read config file: {}", config_path.display()))?; - serde_yaml::from_str(&contents) - .with_context(|| format!("Failed to parse YAML: {}", config_path.display())) +use std::path::{Path, PathBuf}; + +/// Load and parse the `govbot.yml` manifest (datasets / transforms / publish / +/// pipelines). A manifest carrying the retired `tags:` block fails to parse. +pub fn load_manifest(config_path: &Path) -> Result { + Manifest::load(config_path) +} + +/// A resolved publishing job: a publisher definition plus the result stream +/// (already filtered, deduplicated, sorted, and limited) it should emit. +pub struct PublishJob<'a> { + /// The publisher name from `govbot.yml: publish:`. + pub name: &'a str, + /// The typed publisher definition. + pub publisher: &'a Publisher, + /// The records to publish — the result stream this publisher consumes. + pub entries: Vec, + /// Output directory override (CLI `--output-dir`). + pub output_dir_override: Option, + /// Output filename override (CLI `--output-file`). + pub output_file_override: Option, + /// The project directory (where `govbot.yml` lives). Stateful publishers + /// (e.g. `bluesky`'s posted-state ledger) resolve relative paths here. + pub project_dir: PathBuf, + /// `--dry-run`: render but do not emit. The `bluesky` publisher honours + /// this by touching no network and no ledger. + pub dry_run: bool, + /// The companion `html` publisher's public landing-page URL, if the + /// manifest declares one (e.g. `https://example.org/climate-tracker`). + /// The `bluesky` publisher uses this as the default for `{link}` so a + /// post links to the *human-readable* HTML index — not the raw + /// `metadata.json` path that the rss/html publishers' `extract_link` + /// emits by default. None when the manifest has no `html` publisher. + pub html_entry_url: Option, +} + +/// Run a single publisher against its result stream and emit artifacts. +/// +/// **One publisher type, one artifact.** Each built-in publisher writes +/// exactly one kind of file: +/// +/// - `type: rss` writes the RSS feed (default `feed.xml`); +/// - `type: html` writes the HTML index (default `index.html`); +/// - `type: json` writes a JSON dump; +/// - `type: duckdb` loads records into a DuckDB database; +/// - `type: bluesky` posts matched bills to a Bluesky account. +/// +/// Before this split, `rss` and `html` each emitted *both* a feed.xml and +/// an index.html — declaring both in one manifest produced a silent +/// last-writer-wins collision on `index.html`. Declare both explicitly to +/// get both artifacts. +pub fn run_publisher(job: &PublishJob) -> Result<()> { + let p = job.publisher; + let select = p.select.clone().unwrap_or_default(); + + let output_dir = PathBuf::from( + job.output_dir_override + .clone() + .or_else(|| p.output_dir.clone()) + .unwrap_or_else(|| "docs".to_string()), + ); + + match p.kind { + PublisherKind::Rss => emit_rss(job, &select, &output_dir), + PublisherKind::Html => emit_html(job, &output_dir), + PublisherKind::Json => emit_json(job, &output_dir), + PublisherKind::Duckdb => emit_duckdb(job, &output_dir), + PublisherKind::Bluesky => crate::bluesky::run_bluesky(job, job.dry_run), + } } -/// Get repos list from config, handling 'all' special case -pub fn get_repos_from_config(config: &Value) -> Vec { - if let Some(repos) = config.get("repos") { - if let Some(arr) = repos.as_array() { - return arr - .iter() - .filter_map(|v| v.as_str().map(|s| s.to_string())) - .collect(); - } else if let Some(s) = repos.as_str() { - return vec![s.to_string()]; +/// Title-case a tag name (`clean_energy` -> `Clean Energy`). +fn titlecase_tag(tag: &str) -> String { + tag.replace('_', " ") + .split_whitespace() + .map(|w| { + let mut chars = w.chars(); + match chars.next() { + None => String::new(), + Some(f) => f.to_uppercase().collect::() + chars.as_str(), + } + }) + .collect::>() + .join(" ") +} + +/// The `rss` publisher: emits the RSS feed (and *only* the RSS feed). +/// +/// Default output: `/feed.xml`. Pair with a peer `type: html` +/// publisher to also get an `index.html`. Before this split, `rss` also +/// wrote `index.html` — which collided with the `html` publisher's +/// `index.html` and made the rendering last-writer-wins. +fn emit_rss(job: &PublishJob, select: &[String], output_dir: &Path) -> Result<()> { + let p = job.publisher; + + let output_file = job + .output_file_override + .clone() + .or_else(|| p.output_file.clone()) + .unwrap_or_else(|| "feed.xml".to_string()); + + let feed_link = p.base_url.as_deref().unwrap_or("https://example.com"); + + // Auto-derive a title from the selected tags when none is configured. + let feed_title = p.title.clone().unwrap_or_else(|| { + if select.is_empty() { + "Legislation".to_string() + } else { + format!( + "{} Legislation", + select + .iter() + .map(|t| titlecase_tag(t)) + .collect::>() + .join(" & ") + ) + } + }); + + // The auto-description previously read each tag's `description` from + // `govbot.yml`; that taxonomy data now lives in the fastclass bundle, not + // here. Fall back to a simple tag-name-derived description. + let feed_description = p.description.clone().unwrap_or_else(|| { + if select.is_empty() { + "Legislative updates".to_string() + } else { + format!( + "Legislative updates tagged {}", + select + .iter() + .map(|t| titlecase_tag(t)) + .collect::>() + .join(", ") + ) } + }); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + + eprintln!("Generating RSS feed with {} entries...", job.entries.len()); + let rss_xml = rss::json_to_rss( + job.entries.clone(), + &feed_title, + &feed_description, + feed_link, + Some(feed_link), + "en-us", + ); + let rss_path = output_dir.join(&output_file); + fs::write(&rss_path, rss_xml)?; + eprintln!("✓ Generated RSS feed: {}", rss_path.display()); + Ok(()) +} + +/// The `html` publisher: emits the HTML index (and *only* the HTML index). +/// +/// Default output: `/index.html`. Pair with a peer `type: rss` +/// publisher to also get an RSS feed. Before this split, `html` also wrote +/// a `feed.xml` — which collided with the `rss` publisher's `feed.xml`. +fn emit_html(job: &PublishJob, output_dir: &Path) -> Result<()> { + let p = job.publisher; + + let feed_link = p.base_url.as_deref().unwrap_or("https://example.com"); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + + eprintln!( + "Generating HTML index with {} entries...", + job.entries.len() + ); + let output_file = job + .output_file_override + .clone() + .or_else(|| p.output_file.clone()) + .unwrap_or_else(|| "index.html".to_string()); + + // Only pass an explicit (configured) title to the HTML header. + let html_title = p.title.as_deref().filter(|s| !s.trim().is_empty()); + let html = rss::json_to_html(job.entries.clone(), html_title, feed_link, Some(feed_link)); + let html_path = output_dir.join(&output_file); + fs::write(&html_path, html)?; + eprintln!("✓ Generated HTML index: {}", html_path.display()); + Ok(()) +} + +/// The `json` publisher: a JSON dump of the result stream. +fn emit_json(job: &PublishJob, output_dir: &Path) -> Result<()> { + let output_file = job + .output_file_override + .clone() + .or_else(|| job.publisher.output_file.clone()) + .unwrap_or_else(|| "feed.json".to_string()); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + let path = output_dir.join(&output_file); + fs::write(&path, serde_json::to_string_pretty(&job.entries)?)?; + eprintln!( + "✓ Generated JSON dump ({} entries): {}", + job.entries.len(), + path.display() + ); + Ok(()) +} + +/// The `duckdb` publisher: load the result stream into a DuckDB database by +/// writing the records to a JSON file and `read_json_auto`-ing them. +fn emit_duckdb(job: &PublishJob, output_dir: &Path) -> Result<()> { + use std::process::Command; + + let db_file = job + .output_file_override + .clone() + .or_else(|| job.publisher.output_file.clone()) + .unwrap_or_else(|| "feed.duckdb".to_string()); + + fs::create_dir_all(output_dir) + .with_context(|| format!("Failed to create output dir: {}", output_dir.display()))?; + let db_path = output_dir.join(&db_file); + let json_path = output_dir.join(format!("{}.records.json", job.name)); + fs::write(&json_path, serde_json::to_string(&job.entries)?)?; + + let sql = format!( + "CREATE OR REPLACE TABLE records AS SELECT * FROM read_json_auto('{}');", + json_path.display() + ); + let status = Command::new("duckdb") + .arg(db_path.to_string_lossy().as_ref()) + .arg("-c") + .arg(&sql) + .status() + .context("Failed to run `duckdb` — is the DuckDB CLI installed?")?; + if !status.success() { + anyhow::bail!("duckdb publisher '{}' failed", job.name); } - vec!["all".to_string()] + eprintln!( + "✓ Loaded {} entries into DuckDB: {}", + job.entries.len(), + db_path.display() + ); + Ok(()) } /// Filter entries by tags @@ -63,15 +284,28 @@ pub fn filter_by_tags(entry: &Value, tag_names: &[String]) -> bool { false } -/// Deduplicate entries by GUID +/// Deduplicate entries by **bill** (jurisdiction, bill_id) — collapse the +/// N action-log records a bill emits to a single representative, keeping +/// the **first** in the stream. +/// +/// Callers sort by timestamp DESC before this, so the first-per-bill wins +/// is also the **most recent action log**. The post / feed item / +/// HTML entry is rendered from that representative. +/// +/// Before this fix, this function dedup'd by per-log GUID — i.e. it +/// **did not collapse multiple logs for the same bill**, which let an +/// activist see the same bill posted six times in a row (NV AB1 +/// 6×, AK HB53 4× on the climate-tracker feed). The bill_guid is the +/// canonical bill path (`/.../bills/`); see +/// [`rss::bill_guid`]. pub fn deduplicate_entries(entries: Vec) -> Vec { let mut seen = HashSet::new(); let mut result = Vec::new(); for entry in entries { - let guid = rss::extract_guid(&entry); - if !seen.contains(&guid) { - seen.insert(guid); + let bill_key = rss::bill_guid(&entry); + if !seen.contains(&bill_key) { + seen.insert(bill_key); result.push(entry); } } @@ -88,3 +322,337 @@ pub fn sort_by_timestamp(mut entries: Vec) -> Vec { }); entries } + +#[cfg(test)] +mod tests { + use super::*; + use serde_json::json; + use tempfile::tempdir; + + /// Build a `PublishJob` over a tempdir for the given publisher kind. + fn job_for_kind<'a>( + name: &'a str, + publisher: &'a Publisher, + project_dir: PathBuf, + ) -> PublishJob<'a> { + PublishJob { + name, + publisher, + entries: vec![json!({ + "id": "wy-legislation/.../HB0001", + "timestamp": "20250101T000000Z", + "bill": { "title": "Sample bill", "identifier": "HB0001" }, + "sources": { "bill": "wy-legislation/.../HB0001/metadata.json" }, + "tags": { "clean_energy": { "final_score": 0.9 } }, + })], + output_dir_override: None, + output_file_override: None, + project_dir, + dry_run: false, + html_entry_url: None, + } + } + + /// Bug 8 regression: `type: rss` writes ONLY the RSS feed, not an + /// HTML index. Before the split, this publisher kind also produced + /// `index.html`, colliding with the html publisher's `index.html`. + #[test] + fn rss_publisher_writes_only_feed_xml() { + let dir = tempdir().unwrap(); + let p = Publisher { + kind: PublisherKind::Rss, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(dir.path().join("out").to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let job = job_for_kind("feed", &p, dir.path().to_path_buf()); + run_publisher(&job).expect("rss publisher should run"); + + let out = dir.path().join("out"); + assert!(out.join("feed.xml").exists(), "expected feed.xml"); + assert!( + !out.join("index.html").exists(), + "rss publisher must NOT emit index.html — that's the html publisher's job" + ); + } + + /// Bug 8 regression: `type: html` writes ONLY the HTML index, not an + /// RSS feed. Before the split, this publisher kind also produced + /// `feed.xml`, colliding with the rss publisher's `feed.xml`. + #[test] + fn html_publisher_writes_only_index_html() { + let dir = tempdir().unwrap(); + let p = Publisher { + kind: PublisherKind::Html, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(dir.path().join("out").to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let job = job_for_kind("site", &p, dir.path().to_path_buf()); + run_publisher(&job).expect("html publisher should run"); + + let out = dir.path().join("out"); + assert!(out.join("index.html").exists(), "expected index.html"); + assert!( + !out.join("feed.xml").exists(), + "html publisher must NOT emit feed.xml — that's the rss publisher's job" + ); + } + + /// Declaring both `rss` and `html` publishers into the SAME output_dir + /// produces both artifacts side-by-side. Before the split, running + /// `rss` then `html` (or vice versa) produced a silent + /// last-writer-wins collision on `index.html`. + #[test] + fn rss_and_html_publishers_coexist_in_one_output_dir() { + let dir = tempdir().unwrap(); + let out_dir = dir.path().join("out"); + + let rss = Publisher { + kind: PublisherKind::Rss, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: Some("RSS publisher title".to_string()), + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let html = Publisher { + kind: PublisherKind::Html, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: Some("HTML publisher title".to_string()), + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + + let job_rss = job_for_kind("feed", &rss, dir.path().to_path_buf()); + run_publisher(&job_rss).unwrap(); + let job_html = job_for_kind("site", &html, dir.path().to_path_buf()); + run_publisher(&job_html).unwrap(); + + let feed_xml = std::fs::read_to_string(out_dir.join("feed.xml")).unwrap(); + let index_html = std::fs::read_to_string(out_dir.join("index.html")).unwrap(); + // Each publisher's own title must be in its own artifact — proves + // neither publisher overwrote the other's output. + assert!( + feed_xml.contains("RSS publisher title"), + "feed.xml should carry the rss publisher's title" + ); + assert!( + index_html.contains("HTML publisher title"), + "index.html should carry the html publisher's title (not the rss publisher's)" + ); + } + + // ------------------------------------------------------------ + // Per-bill dedup regression tests (same Bug as the bluesky one) + // ------------------------------------------------------------ + + /// Build a synthetic log entry for `bill_id` whose `sources.log` + /// embeds `filename` — the shape `govbot source --join bill,tags` + /// emits. The `timestamp` is included so the upstream `sort_by_timestamp` + /// is exercised the way `run_publish_command` exercises it. + fn log(dataset: &str, session: &str, bill_id: &str, filename: &str, ts: &str) -> Value { + json!({ + "id": bill_id, + "timestamp": ts, + "bill": { "title": format!("Bill {}", bill_id), "identifier": bill_id }, + "log": { "bill_id": bill_id }, + "sources": { + "log": format!( + "{}/country:us/state:xx/sessions/{}/logs/{}", + dataset, session, filename + ) + }, + "tags": { "clean_energy": { "final_score": 0.9 } } + }) + } + + /// Six action-log entries for the same NV AB1 bill must collapse to + /// **one** entry post-dedup — the bug that put 6 NV AB1 posts on the + /// climate-tracker bluesky-pending feed under `datasets: [all]`. RSS + /// and HTML feeds share the same dedup (`deduplicate_entries`). + #[test] + fn deduplicate_entries_collapses_action_logs_to_one_per_bill() { + let entries: Vec = (1..=6) + .map(|i| { + log( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + &format!("2025111{}T080000Z", i), + ) + }) + .collect(); + + let out = deduplicate_entries(entries); + assert_eq!( + out.len(), + 1, + "6 action logs for the same bill must dedup to 1; got {}", + out.len() + ); + } + + /// The dedup keeps **distinct bills** distinct — only logs *for the + /// same bill* are collapsed. A second bill (NV AB2) survives the same + /// dedup pass. + #[test] + fn deduplicate_entries_keeps_distinct_bills() { + let entries = vec![ + log( + "nv-legislation", + "2025Special36", + "AB1", + "20251111T080000Z.a.json", + "20251111T080000Z", + ), + log( + "nv-legislation", + "2025Special36", + "AB1", + "20251112T080000Z.b.json", + "20251112T080000Z", + ), + log( + "nv-legislation", + "2025Special36", + "AB2", + "20251111T080000Z.c.json", + "20251111T080000Z", + ), + ]; + let out = deduplicate_entries(entries); + assert_eq!(out.len(), 2, "AB1 collapses to 1 record; AB2 survives"); + } + + /// The `rss` publisher emits ONE `` per bill — not one per + /// action log. End-to-end check: render an RSS feed from 6 action-log + /// records for the same bill and count `` tags. + #[test] + fn rss_publisher_emits_one_item_per_bill_even_with_multiple_action_logs() { + let dir = tempdir().unwrap(); + let out_dir = dir.path().join("out"); + let p = Publisher { + kind: PublisherKind::Rss, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + // Six action logs for NV AB1. + let entries: Vec = (1..=6) + .map(|i| { + log( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + &format!("2025111{}T080000Z", i), + ) + }) + .collect(); + let job = PublishJob { + name: "feed", + publisher: &p, + entries, + output_dir_override: None, + output_file_override: None, + project_dir: dir.path().to_path_buf(), + dry_run: false, + html_entry_url: None, + }; + run_publisher(&job).expect("rss publisher should run"); + + let feed_xml = std::fs::read_to_string(out_dir.join("feed.xml")).unwrap(); + let item_count = feed_xml.matches("").count(); + assert_eq!( + item_count, 1, + "RSS feed must contain exactly one per bill; got {} items for one bill", + item_count + ); + } + + /// The `html` publisher emits ONE `
` per bill — not one per + /// action log. End-to-end check: render the HTML index from 6 + /// action-log records for the same bill and count `
` tags. + #[test] + fn html_publisher_emits_one_article_per_bill_even_with_multiple_action_logs() { + let dir = tempdir().unwrap(); + let out_dir = dir.path().join("out"); + let p = Publisher { + kind: PublisherKind::Html, + select: None, + base_url: Some("https://example.org/test".to_string()), + output_dir: Some(out_dir.to_string_lossy().to_string()), + output_file: None, + title: None, + description: None, + limit: None, + min_score: None, + ledger: None, + post_template: None, + }; + let entries: Vec = (1..=6) + .map(|i| { + log( + "nv-legislation", + "2025Special36", + "AB1", + &format!("2025111{}T080000Z.classification.referral.json", i), + &format!("2025111{}T080000Z", i), + ) + }) + .collect(); + let job = PublishJob { + name: "site", + publisher: &p, + entries, + output_dir_override: None, + output_file_override: None, + project_dir: dir.path().to_path_buf(), + dry_run: false, + html_entry_url: None, + }; + run_publisher(&job).expect("html publisher should run"); + + let html = std::fs::read_to_string(out_dir.join("index.html")).unwrap(); + let article_count = html.matches(" per bill; got {} for one bill", + article_count + ); + } +} diff --git a/actions/govbot/src/registry.rs b/actions/govbot/src/registry.rs new file mode 100644 index 00000000..f5161e81 --- /dev/null +++ b/actions/govbot/src/registry.rs @@ -0,0 +1,330 @@ +//! The govbot dataset registry — "npm/docker for government data." +//! +//! A registry maps a **dataset identifier** to the git repo holding its data, +//! the data schema it follows, and the glob that locates records within the +//! repo. Datasets are git repos; this index is what lets govbot resolve a +//! dataset at runtime instead of from a compiled enum. +//! +//! ## Identifier scheme +//! +//! A canonical identifier is `namespace/name[@channel]`: +//! - `namespace` — a grouping (`us-legislation`, a county set, an agency set). +//! - `name` — the dataset within the namespace (`wy`, `il`, …). +//! - `@channel` — an optional release channel / branch (defaults to the +//! repo's default branch). +//! +//! **Plain jurisdiction codes stay valid.** A bare identifier with no `/` +//! (e.g. `wy`) is resolved against the registry's `default_namespace`, so an +//! existing manifest `datasets: [wy]` keeps working unchanged. `all` is a +//! reserved alias meaning "every dataset in the registry." +//! +//! ## Where it lives / how it is fetched +//! +//! The default registry is the JSON file `data/registry.json`, **compiled into +//! the binary** via `include_str!` — so a fresh install resolves the seed +//! jurisdictions with zero network access. A project can override it: +//! 1. `GOVBOT_REGISTRY_URL` — an `http(s)://` URL or a local file path. +//! 2. `/.govbot/registry.json` — a project-local registry file. +//! A fetched registry is cached at `~/.govbot/registry.json`. +//! +//! See `actions/govbot/REGISTRY.md` for the full format documentation. + +use crate::error::{Error, Result}; +use serde::{Deserialize, Serialize}; +use std::collections::BTreeMap; +use std::path::PathBuf; + +/// The bundled default registry, compiled into the binary. +const BUNDLED_REGISTRY: &str = include_str!("../data/registry.json"); + +/// A single dataset entry: where its data lives and how to read it. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct DatasetEntry { + /// The git repository URL the dataset's data is cloned from. + pub git_url: String, + + /// The data schema the dataset follows (e.g. `ocdfiles`). Informational — + /// it lets a future `source` projection pick the right reader. + #[serde(default)] + pub schema: Option, + + /// A glob, relative to the cloned repo root, that locates the dataset's + /// records. Replaces the hard-coded `**/logs/*.json` walk. + #[serde(default)] + pub path_pattern: Option, + + /// A human-readable display name (`Wyoming`, `Cook County`, …). + #[serde(default)] + pub name: Option, +} + +/// The parsed registry file. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct Registry { + /// Registry format version, for forward-compatibility. + #[serde(default, rename = "$schema_version")] + pub schema_version: Option, + + /// Free-text description of this registry. + #[serde(default)] + pub description: Option, + + /// The namespace a bare (slash-free) identifier is resolved against. + #[serde(default = "default_namespace")] + pub default_namespace: String, + + /// Canonical `namespace/name` → entry map. + pub datasets: BTreeMap, +} + +fn default_namespace() -> String { + "us-legislation".to_string() +} + +/// A resolved dataset: its canonical id plus the entry it points at. +#[derive(Debug, Clone)] +pub struct ResolvedDataset { + /// The canonical `namespace/name` identifier (channel stripped). + pub id: String, + /// The optional channel (branch) requested via `@channel`. + pub channel: Option, + /// The registry entry. + pub entry: DatasetEntry, +} + +impl ResolvedDataset { + /// The short, slash-free name a clone directory is keyed on (`wy`, `il`). + /// Strips the namespace; this is also the legacy "locale" string. + pub fn short_name(&self) -> &str { + self.id.rsplit('/').next().unwrap_or(&self.id) + } +} + +impl Registry { + /// Parse the bundled default registry. Infallible in practice — the file + /// is validated at build time — but surfaces a `Config` error if not. + pub fn bundled() -> Result { + serde_json::from_str(BUNDLED_REGISTRY) + .map_err(|e| Error::Config(format!("Bundled registry is invalid: {}", e))) + } + + /// Load the active registry, honoring overrides in priority order: + /// 1. `GOVBOT_REGISTRY_URL` (an `http(s)://` URL or a filesystem path). + /// 2. `/.govbot/registry.json` — a project-local registry. + /// 3. The bundled default. + /// + /// `project_dir` is the directory holding `govbot.yml` (or the cwd). + pub fn load(project_dir: &std::path::Path) -> Result { + if let Ok(src) = std::env::var("GOVBOT_REGISTRY_URL") { + if !src.trim().is_empty() { + return Registry::from_source(&src); + } + } + let project_registry = project_dir.join(".govbot").join("registry.json"); + if project_registry.is_file() { + return Registry::from_file(&project_registry); + } + Registry::bundled() + } + + /// Load a registry from a source string: an `http(s)://` URL is fetched + /// (and cached at `~/.govbot/registry.json`), anything else is a file path. + pub fn from_source(src: &str) -> Result { + if src.starts_with("http://") || src.starts_with("https://") { + Registry::fetch(src) + } else { + Registry::from_file(std::path::Path::new(src)) + } + } + + /// Parse a registry from a JSON file on disk. + pub fn from_file(path: &std::path::Path) -> Result { + let contents = std::fs::read_to_string(path).map_err(|e| { + Error::Config(format!("Failed to read registry {}: {}", path.display(), e)) + })?; + serde_json::from_str(&contents) + .map_err(|e| Error::Config(format!("Invalid registry {}: {}", path.display(), e))) + } + + /// Fetch a registry over HTTP and cache it at `~/.govbot/registry.json`. + pub fn fetch(url: &str) -> Result { + let body = ureq::get(url) + .call() + .map_err(|e| Error::Config(format!("Failed to fetch registry {}: {}", url, e)))? + .into_body() + .read_to_string() + .map_err(|e| Error::Config(format!("Failed to read registry body: {}", e)))?; + let registry: Registry = serde_json::from_str(&body) + .map_err(|e| Error::Config(format!("Fetched registry {} is invalid: {}", url, e)))?; + // Best-effort cache write — a failure here is non-fatal. + if let Some(cache) = registry_cache_path() { + if let Some(parent) = cache.parent() { + let _ = std::fs::create_dir_all(parent); + } + let _ = std::fs::write(&cache, &body); + } + Ok(registry) + } + + /// Canonicalize a dataset identifier to `namespace/name` (channel stripped). + /// + /// `wy` → `/wy`; `us-counties/cook` is returned as-is; + /// `wy@nightly` → `/wy`. The channel is returned + /// separately by [`Registry::resolve`]. + pub fn canonical_id(&self, identifier: &str) -> (String, Option) { + let (base, channel) = match identifier.split_once('@') { + Some((b, c)) => (b, Some(c.to_string())), + None => (identifier, None), + }; + let id = if base.contains('/') { + base.to_string() + } else { + format!("{}/{}", self.default_namespace, base) + }; + (id, channel) + } + + /// Resolve a dataset identifier to its registry entry. + /// + /// Accepts a canonical `namespace/name[@channel]` id or a bare jurisdiction + /// code (resolved against `default_namespace`). Returns a `Config` error if + /// the identifier is not in the registry. + pub fn resolve(&self, identifier: &str) -> Result { + let (id, channel) = self.canonical_id(identifier); + let entry = self.datasets.get(&id).ok_or_else(|| { + Error::Config(format!( + "Unknown dataset '{}'. It is not in the registry. \ + Run `govbot search` to list available datasets.", + identifier + )) + })?; + Ok(ResolvedDataset { + id, + channel, + entry: entry.clone(), + }) + } + + /// Resolve a list of identifiers, expanding the `all` alias to every + /// dataset in the registry. Order is preserved; `all` expands in + /// canonical (sorted) order. + pub fn resolve_all(&self, identifiers: &[String]) -> Result> { + let mut out = Vec::new(); + for ident in identifiers { + let ident = ident.trim(); + if ident.is_empty() { + continue; + } + if ident.eq_ignore_ascii_case("all") { + for id in self.datasets.keys() { + out.push(self.resolve(id)?); + } + } else { + out.push(self.resolve(ident)?); + } + } + Ok(out) + } + + /// Every dataset in the registry, in canonical id order. + pub fn all(&self) -> Vec { + self.datasets + .iter() + .map(|(id, entry)| ResolvedDataset { + id: id.clone(), + channel: None, + entry: entry.clone(), + }) + .collect() + } + + /// Search the registry. A blank query matches everything; otherwise the + /// query is matched case-insensitively against the id and the name. + pub fn search(&self, query: &str) -> Vec { + let q = query.trim().to_lowercase(); + self.all() + .into_iter() + .filter(|d| { + if q.is_empty() { + return true; + } + d.id.to_lowercase().contains(&q) + || d.entry + .name + .as_deref() + .map(|n| n.to_lowercase().contains(&q)) + .unwrap_or(false) + }) + .collect() + } +} + +/// The path the most recently fetched registry is cached at: +/// `~/.govbot/registry.json`. +pub fn registry_cache_path() -> Option { + crate::cache::govbot_home().map(|h| h.join("registry.json")) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn bundled_registry_parses_and_has_seed_jurisdictions() { + let reg = Registry::bundled().expect("bundled registry must parse"); + // The seed catalog covers every US state legislature + DC + the + // territories + federal Congress. Asserting the floor (not an exact + // count) keeps the test stable when datasets are added. + assert!( + reg.datasets.len() >= 55, + "expected the seed catalog (>= 55 datasets); got {}", + reg.datasets.len() + ); + assert!(reg.datasets.contains_key("us-legislation/wy")); + } + + #[test] + fn bare_code_resolves_via_default_namespace() { + let reg = Registry::bundled().unwrap(); + let d = reg.resolve("wy").expect("`wy` must resolve"); + assert_eq!(d.id, "us-legislation/wy"); + assert_eq!(d.short_name(), "wy"); + assert!(d.entry.git_url.contains("wy-legislation")); + } + + #[test] + fn canonical_id_and_channel_split() { + let reg = Registry::bundled().unwrap(); + let d = reg.resolve("wy@nightly").unwrap(); + assert_eq!(d.id, "us-legislation/wy"); + assert_eq!(d.channel.as_deref(), Some("nightly")); + } + + #[test] + fn namespaced_id_resolves_directly() { + let reg = Registry::bundled().unwrap(); + let d = reg.resolve("us-legislation/il").unwrap(); + assert_eq!(d.id, "us-legislation/il"); + } + + #[test] + fn unknown_dataset_errors() { + let reg = Registry::bundled().unwrap(); + assert!(reg.resolve("atlantis").is_err()); + } + + #[test] + fn all_alias_expands_to_every_dataset() { + let reg = Registry::bundled().unwrap(); + let resolved = reg.resolve_all(&["all".to_string()]).unwrap(); + assert_eq!(resolved.len(), reg.datasets.len()); + } + + #[test] + fn search_matches_id_and_name() { + let reg = Registry::bundled().unwrap(); + assert!(!reg.search("wyoming").is_empty()); + assert!(!reg.search("wy").is_empty()); + assert_eq!(reg.search("").len(), reg.datasets.len()); + } +} diff --git a/actions/govbot/src/rss.rs b/actions/govbot/src/rss.rs index fdf55d1e..398f5518 100644 --- a/actions/govbot/src/rss.rs +++ b/actions/govbot/src/rss.rs @@ -275,7 +275,12 @@ pub fn extract_link(entry: &Value, base_url: Option<&str>) -> Option { None } -/// Extract or generate a unique GUID for the entry +/// Extract or generate a unique GUID for the entry. +/// +/// This is a **per-log** GUID — distinct for every action-log file. For a +/// **per-bill** key (the one publishers should use to dedup so an activist +/// doesn't see the same bill posted six times under six different action +/// logs), see [`bill_guid`]. pub fn extract_guid(entry: &Value) -> String { // Use source log path as GUID if available if let Some(sources) = entry.get("sources").and_then(|s| s.as_object()) { @@ -298,6 +303,84 @@ pub fn extract_guid(entry: &Value) -> String { format!("{}_{}", timestamp, bill_id) } +/// A **per-bill** GUID — the grouping key publishers use to emit one post / +/// item / row per (jurisdiction, bill_id) rather than one per action log. +/// +/// Publishers (bluesky, rss, html) receive a result stream where the same +/// bill emits one record per action-log file (committee referrals, hearings, +/// passage votes, …). Activists want one item per bill, not N. The +/// **bill_guid** collapses the N action-log records to a single (jurisdiction, +/// bill_id) key. +/// +/// Resolution order — each path produces the same canonical form +/// `/country:.../state:.../sessions//bills/`: +/// +/// 1. `sources.log` → strip `/logs/` tail; if the stripped path +/// already ends in `/bills/` (per-bill-log-directory layout) use it +/// verbatim, else append `/bills/` (session-level-log-directory +/// layout — the OCD-files common case). +/// 2. `sources.bill` → strip `/metadata.json` tail. The parent dir IS the +/// bill dir on disk. +/// 3. fall back to the per-log GUID (see [`extract_guid`]) — preserves the +/// pre-fix shape when an entry carries no `sources` block at all. +/// +/// `bill_id` is taken from `bill.identifier`, then `id`, then `log.bill_id`, +/// matching what publishers already use for rendering. +pub fn bill_guid(entry: &Value) -> String { + // Resolve the bill identifier the publishers already render with. + let bill_id = entry + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str()) + .or_else(|| entry.get("id").and_then(|v| v.as_str())) + .or_else(|| { + entry + .get("log") + .and_then(|l| l.get("bill_id")) + .and_then(|v| v.as_str()) + }) + .map(|s| s.trim().to_string()) + .unwrap_or_default(); + + // 1. `sources.log` — the dominant path. Strip `/logs/...` to get the + // enclosing dir, then ensure it ends in `/bills/`. + if let Some(log_path) = entry + .get("sources") + .and_then(|s| s.get("log")) + .and_then(|v| v.as_str()) + { + if let Some(prefix) = log_path.split("/logs/").next() { + if !bill_id.is_empty() { + let needle = format!("/bills/{}", bill_id); + if prefix.ends_with(&needle) { + return prefix.to_string(); + } + return format!("{}{}", prefix, needle); + } + // No bill_id resolved — the prefix is still a stable dedup key + // (collapses all session-level logs to one row, which is a + // reasonable fallback when bill_id is missing). + return prefix.to_string(); + } + } + + // 2. `sources.bill` — the path to `metadata.json`; its parent dir is + // `bills/`. Strip the trailing `/metadata.json`. + if let Some(bill_path) = entry + .get("sources") + .and_then(|s| s.get("bill")) + .and_then(|v| v.as_str()) + { + let trimmed = bill_path + .strip_suffix("/metadata.json") + .unwrap_or(bill_path); + return trimmed.to_string(); + } + + // 3. No sources at all — fall back to the per-log GUID. + extract_guid(entry) +} + /// Convert JSON Lines entries to RSS feed pub fn json_to_rss( entries: Vec, @@ -310,16 +393,22 @@ pub fn json_to_rss( let base_url = base_url.unwrap_or(link); let mut items = Vec::new(); - let mut seen_guids = HashSet::new(); + let mut seen_bills = HashSet::new(); for entry in entries { - let guid = extract_guid(&entry); - - // Deduplicate by GUID - if seen_guids.contains(&guid) { + // Dedup by **bill** — not by action-log path. A bill emits one + // record per action-log file (committee referral, hearing, passage + // vote …); RSS readers want one item per bill, not N. The first + // (newest, since the stream is timestamp-sorted DESC) wins; later + // action-log records for the same bill are dropped. The RSS + // `` itself still uses the per-log GUID so a feed reader + // doesn't conflate two genuinely different items across feeds. + let bill_key = bill_guid(&entry); + if seen_bills.contains(&bill_key) { continue; } - seen_guids.insert(guid.clone()); + seen_bills.insert(bill_key); + let guid = extract_guid(&entry); let mut item_builder = ItemBuilder::default(); @@ -497,16 +586,16 @@ pub fn json_to_html( let title_str = title.unwrap_or(""); let mut items_html = String::new(); - let mut seen_guids = HashSet::new(); + let mut seen_bills = HashSet::new(); for entry in entries { - let guid = extract_guid(&entry); - - // Deduplicate by GUID - if seen_guids.contains(&guid) { + // Dedup by **bill** — see `json_to_rss` for the rationale. One HTML + // entry per bill, not one per action log. + let bill_key = bill_guid(&entry); + if seen_bills.contains(&bill_key) { continue; } - seen_guids.insert(guid); + seen_bills.insert(bill_key); let entry_title = extract_title(&entry); let entry_description = extract_description(&entry); @@ -559,7 +648,6 @@ pub fn json_to_html(
- {}
"#, @@ -572,7 +660,6 @@ pub fn json_to_html( .and_then(|t| t.as_str()) .unwrap_or(""), date_html, - if !date_html.is_empty() { "" } else { "" } )); } diff --git a/actions/govbot/src/selectors.rs b/actions/govbot/src/selectors.rs index 4e3de940..730e9387 100644 --- a/actions/govbot/src/selectors.rs +++ b/actions/govbot/src/selectors.rs @@ -1,94 +1,211 @@ /// Default selector for OCDFiles-style JSON structures. -/// Extracts human-readable text content from a JSON value, focusing on bill and log content. +/// +/// Extracts the **full** human-readable text content of a bill from its +/// `metadata.json` projection — per the govbot stream protocol (`STREAM_PROTOCOL.md` +/// §1), the `docs` projection must emit the full bill text, not just titles, so +/// downstream transforms (classification, summarization) see the whole document. +/// +/// For an entry that joins `bill` (the full `metadata.json`) this assembles +/// every text-bearing field of the bill: title, identifier, every abstract, +/// every subject, action descriptions, sponsor names, version notes, related +/// bills, the legislative session and originating organization. For a bare log +/// entry it falls back to the action description. pub fn ocd_files_select_default(value: &serde_json::Value) -> String { + let mut texts = Vec::new(); + collect_bill_text(value, &mut texts); + // Drop empties and de-dup adjacent blanks; join with spaces. + texts + .into_iter() + .filter(|s| !s.trim().is_empty()) + .collect::>() + .join(" ") +} + +/// Extract the OCD `subject:` array — the gold-standard structured topic +/// classification a human OCD scraper assigned to the bill. +/// +/// This is the input the `docs` projection adds as an optional `subjects` +/// field so downstream transforms (e.g. fastclass's `concept_match` matcher) +/// can use the controlled-vocabulary signal directly instead of re-deriving +/// it from the bill text. +/// +/// Returns: +/// - `Some(non-empty Vec)` when `metadata.json` has a `subject:` array with +/// at least one non-empty string. +/// - `None` when: +/// - the entry has no bill metadata join (`--join bill` not requested), +/// - the bill metadata has no `subject:` key, +/// - the `subject:` array is empty (`[]`), or +/// - every element is a blank string. +/// +/// **Why empty == None.** Many states populate `subject:` for some bills and +/// leave it `[]` for others; emitting `"subjects": []` would conflate +/// "no signal" with "explicitly no subjects" and force the consumer to +/// distinguish them. Omitting the field entirely is the unambiguous +/// "no signal" form per STREAM_PROTOCOL §1. +pub fn ocd_files_extract_subjects(value: &serde_json::Value) -> Option> { + let bill = bill_object(value)?; + let raw = bill.get("subject")?.as_array()?; + let subjects: Vec = raw + .iter() + .filter_map(|v| v.as_str()) + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty()) + .collect(); + if subjects.is_empty() { + None + } else { + Some(subjects) + } +} + +/// Find the bill `metadata.json` object inside an entry, mirroring how +/// `collect_bill_text` routes between the three wrapping shapes: +/// - `{ "bill": { ... } }` — the joined form +/// - `{ "log": { ... } }` — bare log; no bill metadata available +/// - `{ ... }` — the map *is* a bill metadata.json +fn bill_object(value: &serde_json::Value) -> Option<&serde_json::Map> { + let map = value.as_object()?; + if let Some(bill) = map.get("bill").and_then(|v| v.as_object()) { + return Some(bill); + } + if map.contains_key("log") { + // Bare log entry — `subject:` lives on the bill, which isn't joined. + return None; + } + Some(map) +} + +/// Append every text-bearing string of an OCD-files value into `texts`. +fn collect_bill_text(value: &serde_json::Value, texts: &mut Vec) { match value { - serde_json::Value::String(s) => s.clone(), + serde_json::Value::String(s) => texts.push(s.clone()), serde_json::Value::Object(map) => { - let mut texts = Vec::new(); - - // Extract from bill object (if present) + // The full bill metadata, when joined under `bill`. if let Some(bill) = map.get("bill") { - if let Some(title) = bill.get("title").and_then(|v| v.as_str()) { - texts.push(title.to_string()); - } - if let Some(subjects) = bill.get("subject") { - texts.push(ocd_files_select_default(subjects)); - } - if let Some(abstracts) = bill.get("abstracts") { - texts.push(ocd_files_select_default(abstracts)); - } - if let Some(session) = bill.get("legislative_session").and_then(|v| v.as_str()) { - texts.push(session.to_string()); - } - if let Some(org) = bill.get("from_organization").and_then(|v| v.as_str()) { - texts.push(org.to_string()); - } + collect_bill_fields(bill, texts); } - - // Extract from log object (if present) + // A bare log object. if let Some(log) = map.get("log") { - if let Some(action) = log.get("action") { - // Extract description from action object - if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { - texts.push(desc.to_string()); - } - // Or if action is directly a string - if let Some(desc_str) = action.as_str() { - texts.push(desc_str.to_string()); - } - } - // Also check for bill_id in log - if let Some(bill_id) = log - .get("bill_id") - .or_else(|| log.get("bill_identifier")) - .and_then(|v| v.as_str()) - { - texts.push(bill_id.to_string()); - } + collect_log_fields(log, texts); + } + // The map *is* a bill metadata.json (no `bill`/`log` wrappers). + if map.get("bill").is_none() && map.get("log").is_none() { + collect_bill_fields(value, texts); + } + } + serde_json::Value::Array(arr) => { + for item in arr { + collect_bill_text(item, texts); + } + } + _ => {} + } +} + +/// Append every text-bearing field of a bill `metadata.json` object. +fn collect_bill_fields(bill: &serde_json::Value, texts: &mut Vec) { + let serde_json::Value::Object(map) = bill else { + // Not an object — recurse generically (e.g. when `bill` is a string). + collect_strings(bill, texts); + return; + }; + + push_str(map, "title", texts); + push_str(map, "identifier", texts); + push_str(map, "legislative_session", texts); + push_str(map, "from_organization", texts); + + // Free-text arrays and nested arrays of objects. + if let Some(v) = map.get("abstracts") { + collect_strings(v, texts); + } + if let Some(v) = map.get("subject") { + collect_strings(v, texts); + } + if let Some(v) = map.get("other_titles") { + collect_strings(v, texts); + } + if let Some(v) = map.get("other_identifiers") { + collect_strings(v, texts); + } + + // Action descriptions. + if let Some(actions) = map.get("actions").and_then(|v| v.as_array()) { + for action in actions { + if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { + texts.push(desc.to_string()); } + } + } - // Extract from action object directly (if present at top level, e.g., when processing log object) - if let Some(action) = map.get("action") { - // Extract description from action object - if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { - texts.push(desc.to_string()); - } - // Or if action is directly a string - if let Some(desc_str) = action.as_str() { - texts.push(desc_str.to_string()); - } + // Sponsor names. + if let Some(sponsors) = map.get("sponsorships").and_then(|v| v.as_array()) { + for sponsor in sponsors { + if let Some(name) = sponsor.get("name").and_then(|v| v.as_str()) { + texts.push(name.to_string()); } + } + } - // Fallback: extract from all other text fields (excluding metadata) - for (key, val) in map { - if !key.starts_with("_") - && key != "id" - && key != "sources" - && key != "timestamp" - && key != "bill" - && key != "log" - && key != "title" - && key != "action" - && key != "subjects" - && key != "abstracts" - && key != "legislative_session" - && key != "from_organization" - { - if let Some(text) = val.as_str() { - texts.push(text.to_string()); - } else if val.is_object() || val.is_array() { - texts.push(ocd_files_select_default(val)); - } - } + // Version notes (the closest thing to bill body text in metadata.json). + if let Some(versions) = map.get("versions").and_then(|v| v.as_array()) { + for version in versions { + if let Some(note) = version.get("note").and_then(|v| v.as_str()) { + texts.push(note.to_string()); } + } + } - texts.join(" ") + // Documents notes. + if let Some(docs) = map.get("documents").and_then(|v| v.as_array()) { + for doc in docs { + if let Some(note) = doc.get("note").and_then(|v| v.as_str()) { + texts.push(note.to_string()); + } + } + } +} + +/// Append the text-bearing fields of a log object (action description, bill id). +fn collect_log_fields(log: &serde_json::Value, texts: &mut Vec) { + if let Some(action) = log.get("action") { + if let Some(desc) = action.get("description").and_then(|v| v.as_str()) { + texts.push(desc.to_string()); + } else if let Some(desc) = action.as_str() { + texts.push(desc.to_string()); + } + } + if let Some(bill_id) = log + .get("bill_id") + .or_else(|| log.get("bill_identifier")) + .and_then(|v| v.as_str()) + { + texts.push(bill_id.to_string()); + } +} + +/// Append a single string-valued map field, if present. +fn push_str(map: &serde_json::Map, key: &str, texts: &mut Vec) { + if let Some(s) = map.get(key).and_then(|v| v.as_str()) { + texts.push(s.to_string()); + } +} + +/// Append every string found anywhere in a JSON value (arrays, nested objects). +fn collect_strings(value: &serde_json::Value, texts: &mut Vec) { + match value { + serde_json::Value::String(s) => texts.push(s.clone()), + serde_json::Value::Array(arr) => { + for item in arr { + collect_strings(item, texts); + } + } + serde_json::Value::Object(map) => { + for v in map.values() { + collect_strings(v, texts); + } } - serde_json::Value::Array(arr) => arr - .iter() - .map(ocd_files_select_default) - .collect::>() - .join(" "), - _ => String::new(), + _ => {} } } diff --git a/actions/govbot/src/tagfile.rs b/actions/govbot/src/tagfile.rs new file mode 100644 index 00000000..728886aa --- /dev/null +++ b/actions/govbot/src/tagfile.rs @@ -0,0 +1,87 @@ +//! Per-bill tag-file persistence types — the on-disk `.tag.json` format. +//! +//! `govbot apply` (the apply sink of `govbot source --select docs | +//! fastclass classify - | govbot apply`) deserializes a `fastclass classify` +//! result and writes these structs into `/tags/...`; `govbot +//! publish` reads them back as input to the publishers. +//! +//! This module used to be `embeddings.rs` and housed the in-process ONNX +//! embedding pipeline. govbot no longer classifies bills itself — +//! classification is now delegated to `fastclass` over a process boundary +//! (see `schemas/STREAM_PROTOCOL.md`) — so the ONNX machinery has been +//! removed and what remains is just the tag-file shape. Renamed to +//! `tagfile.rs` to match what it actually contains. + +use serde::{Deserialize, Serialize}; +use sha2::{Digest, Sha256}; +use std::collections::HashMap; + +/// Breakdown of scoring components for a tag match. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct ScoreBreakdown { + pub final_score: f64, + pub base_embedding: Option, + pub example_similarity: Option, + /// Keywords from include_keywords that matched in the text. + #[serde(default)] + pub keyword_match: Vec, + pub negative_penalty: f64, +} + +/// A per-tag `.tag.json` file: metadata, an optional text cache, and the +/// bills that matched the tag. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct TagFile { + pub metadata: TagFileMetadata, + pub tag_config: TagDefinition, + #[serde(default)] + pub text_cache: HashMap, + pub bills: HashMap, +} + +/// Metadata about a tag file. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct TagFileMetadata { + pub last_run: String, + pub model: String, + pub tag_config_hash: String, +} + +/// Result for a single bill within a tag file. +#[derive(Debug, Clone, Serialize, Deserialize)] +pub struct BillTagResult { + pub text_hash: String, + pub score: ScoreBreakdown, +} + +/// Hash text (SHA-256 hex) for deduplication / `tag_config` stamping. +pub fn hash_text(text: &str) -> String { + let mut hasher = Sha256::new(); + hasher.update(text.as_bytes()); + format!("{:x}", hasher.finalize()) +} + +/// A stub tag definition stamped into each tag file. The real taxonomy lives +/// in the fastclass classifier bundle, not here — `govbot apply` only records +/// the tag name. +#[derive(Debug, Deserialize, Serialize, Clone)] +pub struct TagDefinition { + pub name: String, + #[serde(default)] + pub description: String, + #[serde(default)] + pub examples: Vec, + #[serde(default)] + pub include_keywords: Vec, + #[serde(default)] + pub exclude_keywords: Vec, + #[serde(default)] + pub negative_examples: Vec, + /// Minimum similarity score (0.0 - 1.0). Defaults to 0.5. + #[serde(default = "default_threshold")] + pub threshold: f32, +} + +fn default_threshold() -> f32 { + 0.5 +} diff --git a/actions/govbot/src/wizard.rs b/actions/govbot/src/wizard.rs index 9368fb37..4d61b93a 100644 --- a/actions/govbot/src/wizard.rs +++ b/actions/govbot/src/wizard.rs @@ -6,8 +6,9 @@ use std::path::Path; /// Represents the user's choices during the wizard. /// Used both by the interactive wizard and by tests to simulate different paths. pub struct WizardChoices { - pub repos: Vec, - pub include_example_tag: bool, + /// The datasets the project consumes (`govbot.yml: datasets:`). + pub datasets: Vec, + /// Base URL for the RSS/HTML publisher. pub base_url: String, } @@ -31,61 +32,45 @@ impl WizardSession { // Welcome display.push_str("Welcome to govbot! Let's set up your project.\n\n"); - // Step 1: Sources - display.push_str("? What data sources do you want to track?\n"); - if choices.repos == ["all"] { - display.push_str("> All states (47 jurisdictions)\n"); - display.push_str(" Select specific states\n"); + // Step 1: Datasets + display.push_str("? What datasets do you want to track?\n"); + if choices.datasets == ["all"] { + display.push_str("> All jurisdictions in the registry\n"); + display.push_str(" Select specific datasets\n"); } else { - display.push_str(" All states (47 jurisdictions)\n"); - display.push_str("> Select specific states\n"); + display.push_str(" All jurisdictions in the registry\n"); + display.push_str("> Select specific datasets\n"); display.push('\n'); - display.push_str("Available states/jurisdictions:\n"); - let all_locales = crate::locale::WorkingLocale::all(); - let locale_strs: Vec = all_locales.iter().map(|l| l.as_str().to_string()).collect(); - for chunk in locale_strs.chunks(10) { - display.push_str(&format!(" {}\n", chunk.join(", "))); - } + display.push_str("Browse the registry with `govbot search`.\n"); display.push('\n'); - display.push_str(&format!("? Enter state codes separated by spaces: {}\n", choices.repos.join(" "))); + display.push_str(&format!( + "? Enter dataset ids separated by spaces: {}\n", + choices.datasets.join(" ") + )); } display.push('\n'); - // Step 2: Tags - display.push_str("Tags let govbot categorize legislation by topics you care about.\n"); - display.push_str("Here's an example tag definition:\n\n"); - display.push_str(" education:\n"); - display.push_str(" description: |\n"); - display.push_str(" Legislation related to schools, education funding,\n"); - display.push_str(" curriculum standards, and educational policy.\n"); - display.push_str(" examples:\n"); - display.push_str(" - \"Increases per-pupil funding for public schools\"\n"); - display.push_str(" - \"Mandates comprehensive sex education curriculum\"\n\n"); - - display.push_str("? How would you like to set up tags?\n"); - if choices.include_example_tag { - display.push_str("> Use the example \"education\" tag to start\n"); - display.push_str(" I'll create my own tags later\n"); - } else { - display.push_str(" Use the example \"education\" tag to start\n"); - display.push_str("> I'll create my own tags later\n"); - display.push('\n'); - display.push_str(&ai_prompt_template()); - } - display.push('\n'); + // Step 2: Classification (a separate fastclass bundle, not govbot.yml) + display.push_str("Classification is done by fastclass against a classifier bundle.\n"); + display.push_str("Point the manifest's `transforms.classify.classifier` at your\n"); + display.push_str("bundle directory (containing classifier.yml). See the fastclass\n"); + display.push_str("docs to build one.\n\n"); // Step 3: Publishing - display.push_str("Publishing is configured for RSS feeds by default.\n"); - display.push_str("Your feeds will be generated in the \"docs\" directory.\n\n"); - display.push_str(&format!("? Base URL for your feeds: {}\n\n", choices.base_url)); + display.push_str("Publishing is configured for an RSS feed + HTML index by default.\n"); + display.push_str("Both land in the \"docs\" directory (feed.xml + index.html).\n\n"); + display.push_str(&format!( + "? Base URL for your feeds: {}\n\n", + choices.base_url + )); // Summary display.push_str(" ✓ Created govbot.yml\n"); - display.push_str(" ✓ Created .gitignore with .govbot\n"); + display.push_str(" ✓ Created .gitignore\n"); display.push_str(" ✓ Created .github/workflows/build.yml\n\n"); display.push_str("Setup complete! Run 'govbot' again to start the pipeline.\n"); - let govbot_yml = generate_govbot_yml(&choices.repos, choices.include_example_tag, &choices.base_url); + let govbot_yml = generate_govbot_yml(&choices.datasets, &choices.base_url); let workflow_yml = github_workflow_content().to_string(); WizardSession { @@ -125,37 +110,11 @@ impl WizardSession { } } -/// The AI prompt template shown when users choose to create their own tags. -pub fn ai_prompt_template() -> String { - let mut s = String::new(); - s.push_str("To create a tag, copy this prompt into your preferred AI tool:\n\n"); - s.push_str("---\n"); - s.push_str("Create a govbot tag definition in YAML for tracking [YOUR TOPIC] legislation.\n"); - s.push_str("The tag should have:\n"); - s.push_str("- A description (multiline, covering subtopics)\n"); - s.push_str("- 2-3 example bill descriptions that would match\n"); - s.push_str("- Optional: include_keywords and exclude_keywords lists\n\n"); - s.push_str("Format:\n"); - s.push_str(" tag_name:\n"); - s.push_str(" description: |\n"); - s.push_str(" ...\n"); - s.push_str(" examples:\n"); - s.push_str(" - \"...\"\n"); - s.push_str(" include_keywords:\n"); - s.push_str(" - keyword1\n"); - s.push_str(" exclude_keywords:\n"); - s.push_str(" - keyword1\n"); - s.push_str("---\n\n"); - s.push_str("Paste the result into your govbot.yml under the 'tags:' section.\n"); - s -} - /// Generate default govbot.yml and supporting files without interactive prompts. /// Used when `govbot init` is run in a non-interactive terminal. pub fn write_default_files(dir: &Path) -> Result<()> { let choices = WizardChoices { - repos: vec!["all".to_string()], - include_example_tag: true, + datasets: vec!["all".to_string()], base_url: "https://example.com".to_string(), }; let session = WizardSession::from_choices(&choices); @@ -181,22 +140,21 @@ pub fn run_wizard() -> Result<()> { eprintln!("Welcome to govbot! Let's set up your project."); eprintln!(); - // Step 1: Sources - let repos = prompt_sources()?; + // Step 1: Datasets + let datasets = prompt_sources()?; - // Step 2: Tags - let include_example_tag = prompt_tags()?; + // Step 2: Classification — handled by a separate fastclass bundle. + eprintln!(); + eprintln!("Classification is done by fastclass against a classifier bundle."); + eprintln!("Point the manifest's `transforms.classify.classifier` at your"); + eprintln!("bundle directory (containing classifier.yml)."); // Step 3: Publishing info let base_url = prompt_publishing()?; // Generate and write files let cwd = std::env::current_dir()?; - let choices = WizardChoices { - repos, - include_example_tag, - base_url, - }; + let choices = WizardChoices { datasets, base_url }; let session = WizardSession::from_choices(&choices); session.write_files(&cwd)?; @@ -209,8 +167,8 @@ pub fn run_wizard() -> Result<()> { fn prompt_sources() -> Result> { let options = vec![ - "All states (47 jurisdictions)", - "Select specific states", + "All jurisdictions in the registry", + "Select specific datasets", ]; let selection = Select::new() @@ -223,19 +181,26 @@ fn prompt_sources() -> Result> { return Ok(vec!["all".to_string()]); } - // Show available states and let user type them - let all_locales = crate::locale::WorkingLocale::all(); - let locale_strs: Vec = all_locales.iter().map(|l| l.as_str().to_string()).collect(); - - eprintln!(); - eprintln!("Available states/jurisdictions:"); - for chunk in locale_strs.chunks(10) { - eprintln!(" {}", chunk.join(", ")); + // List the registry's datasets so the user can pick from them. + let cwd = std::env::current_dir().unwrap_or_else(|_| std::path::PathBuf::from(".")); + if let Ok(registry) = crate::registry::Registry::load(&cwd) { + let ids: Vec = registry + .all() + .iter() + .map(|d| d.short_name().to_string()) + .collect(); + eprintln!(); + eprintln!("Available datasets ({}):", ids.len()); + for chunk in ids.chunks(10) { + eprintln!(" {}", chunk.join(", ")); + } + eprintln!(); + eprintln!("Tip: `govbot search ` searches the registry."); + eprintln!(); } - eprintln!(); let input: String = Input::new() - .with_prompt("Enter state codes separated by spaces (e.g., il ca ny)") + .with_prompt("Enter dataset ids separated by spaces (e.g., il ca ny)") .interact_text()?; let repos: Vec = input @@ -251,46 +216,10 @@ fn prompt_sources() -> Result> { } } -fn prompt_tags() -> Result { - eprintln!(); - eprintln!("Tags let govbot categorize legislation by topics you care about."); - eprintln!("Here's an example tag definition:"); - eprintln!(); - eprintln!(" education:"); - eprintln!(" description: |"); - eprintln!(" Legislation related to schools, education funding,"); - eprintln!(" curriculum standards, and educational policy."); - eprintln!(" examples:"); - eprintln!(" - \"Increases per-pupil funding for public schools\""); - eprintln!(" - \"Mandates comprehensive sex education curriculum\""); - eprintln!(); - - let options = vec![ - "Use the example \"education\" tag to start", - "I'll create my own tags later", - ]; - - let selection = Select::new() - .with_prompt("How would you like to set up tags?") - .items(&options) - .default(0) - .interact()?; - - if selection == 1 { - let template = ai_prompt_template(); - for line in template.lines() { - eprintln!("{}", line); - } - eprintln!(); - } - - Ok(selection == 0) -} - fn prompt_publishing() -> Result { eprintln!(); - eprintln!("Publishing is configured for RSS feeds by default."); - eprintln!("Your feeds will be generated in the \"docs\" directory."); + eprintln!("Publishing is configured for an RSS feed + HTML index by default."); + eprintln!("Both land in the \"docs\" directory (feed.xml + index.html)."); eprintln!(); let base_url: String = Input::new() @@ -301,84 +230,137 @@ fn prompt_publishing() -> Result { Ok(base_url) } -/// Generate govbot.yml content from wizard answers. +/// Generate a `govbot.yml` manifest from wizard answers. +/// +/// The manifest declares `datasets` + `transforms` + `publish` + `pipelines` — +/// it is NOT a classifier. The tag taxonomy lives in a separate fastclass +/// classifier bundle that `transforms.classify.classifier` references by path. /// This is a pure function for easy testing. -pub fn generate_govbot_yml(repos: &[String], include_example_tag: bool, base_url: &str) -> String { +pub fn generate_govbot_yml(datasets: &[String], base_url: &str) -> String { let mut yml = String::new(); - yml.push_str("# Govbot Configuration\n"); + yml.push_str("# Govbot Manifest\n"); yml.push_str("# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json\n"); yml.push_str("$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json\n\n"); - // Repos section - yml.push_str("repos:\n"); - for repo in repos { - yml.push_str(&format!(" - {}\n", repo)); + // datasets — the government-data sources this project consumes. + yml.push_str("datasets:\n"); + for dataset in datasets { + yml.push_str(&format!(" - {}\n", dataset)); } yml.push('\n'); - // Tags section - yml.push_str("tags:\n"); - if include_example_tag { - yml.push_str(" education:\n"); - yml.push_str(" description: |\n"); - yml.push_str(" Legislation related to schools, education funding, curriculum standards, and educational policy, including:\n"); - yml.push_str(" - K-12 public school funding, budgets, and resource allocation\n"); - yml.push_str(" - Curriculum standards, content requirements, and academic programs\n"); - yml.push_str(" - Teacher certification, training, professional development, and compensation\n"); - yml.push_str(" - Higher education policy, tuition, financial aid, and student loans\n"); - yml.push_str(" - Charter schools, school choice, vouchers, and alternative education models\n"); - yml.push_str(" - Special education services, accommodations, and individualized education plans\n"); - yml.push_str(" - School safety, security measures, and student discipline policies\n"); - yml.push_str(" - Early childhood education, pre-K programs, and childcare\n"); - yml.push_str(" - Standardized testing, assessments, and accountability measures\n"); - yml.push_str(" - School district governance, administration, and oversight\n"); - yml.push_str(" - Educational technology, digital learning, and online education\n"); - yml.push_str(" - Career and technical education, vocational training, and workforce development\n"); - yml.push_str(" examples:\n"); - yml.push_str(" - \"Increases per-pupil funding for public schools and establishes minimum teacher salary requirements\"\n"); - yml.push_str(" - \"Mandates comprehensive sex education curriculum in all public schools\"\n"); - yml.push_str(" - \"Expands eligibility for state financial aid programs to include part-time students\"\n"); - } else { - yml.push_str(" # Add your tags here. Example:\n"); - yml.push_str(" # my_topic:\n"); - yml.push_str(" # description: |\n"); - yml.push_str(" # Legislation related to ...\n"); - yml.push_str(" # examples:\n"); - yml.push_str(" # - \"Example bill description\"\n"); - yml.push_str(" {}\n"); - } + // transforms — external processes speaking the govbot stream protocol. + // The classify transform shells out to fastclass; point `classifier:` at + // your fastclass classifier bundle directory (containing classifier.yml). + yml.push_str("transforms:\n"); + yml.push_str(" classify:\n"); + yml.push_str(" command: [fastclass, classify, \"-\"]\n"); + yml.push_str(" reads: docs\n"); + yml.push_str(" writes: classification\n"); + yml.push_str(" # Path to your fastclass classifier bundle (containing classifier.yml).\n"); + yml.push_str(" classifier: ./classifier\n"); + yml.push('\n'); + + // publish — one publisher type, one artifact. + // - `feed` (type: rss) writes /feed.xml + // - `site` (type: html) writes /index.html + yml.push_str("publish:\n"); + yml.push_str(" feed:\n"); + yml.push_str(" type: rss\n"); + yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); + yml.push_str(" output_dir: \"docs\"\n"); + yml.push_str(" output_file: \"feed.xml\"\n"); + yml.push_str(" site:\n"); + yml.push_str(" type: html\n"); + yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); + yml.push_str(" output_dir: \"docs\"\n"); + yml.push_str(" output_file: \"index.html\"\n"); yml.push('\n'); - // Build section - yml.push_str("build:\n"); - yml.push_str(&format!(" base_url: \"{}\"\n", base_url)); - yml.push_str(" output_dir: \"docs\"\n"); - yml.push_str(" output_file: \"feed.xml\"\n"); + // pipelines — named `govbot run` targets, npm-script style. + yml.push_str("pipelines:\n"); + yml.push_str(" default:\n"); + yml.push_str(" - classify\n"); + yml.push_str(" - feed\n"); + yml.push_str(" - site\n"); yml } -/// Write .gitignore with .govbot entry +/// Write .gitignore with govbot's generated dirs and secret-bearing files. +/// +/// Everything under `.govbot/` (cloned datasets, sync state — the cache), +/// every publisher output dir (`dist/`, `docs/`), the classification-output +/// dir `tags/`, the operational-state dir `state/`, and any local `.env` is +/// untracked. The userland repo is a few dozen text files plus tool +/// artifacts; the artifacts never belong in git. +/// +/// **`tags/` trade-off.** `govbot apply` writes per-tag `.tag.json` files +/// under `tags//country:.../sessions//`. The file count grows +/// with the catalog and most bots regenerate from raw data on every run — +/// so it is git-ignored by default. Users who want classification +/// provenance committed (e.g. for offline review or auditability) can +/// remove the `tags/` line from this file. +/// +/// **`state/` trade-off.** The bluesky publisher writes its posted-state +/// ledger under `state/bluesky-.ledger` — the append-only record of +/// which bills have already been posted. Ignored by default to keep the +/// repo clean; remove the `state/` line to commit the post history and +/// let a cold clone (e.g. a fresh CI runner) resume without double-posts. +/// Same regenerable-but-operational shape as `tags/`. pub fn write_gitignore(cwd: &Path) -> Result<()> { let gitignore_path = cwd.join(".gitignore"); - let gitignore_entry = ".govbot\n"; + // Single canonical block — easy to grep, easy to update. + let block = "\ +# govbot — generated, reconstructed on every run +.govbot/ +dist/ +docs/ +# Classification output from `govbot apply` — regenerated each run. +# Remove this line if you want classification provenance committed. +tags/ +# Publisher state — append-only ledgers (e.g. bluesky's posted-state). +# Regenerable-but-operational: deleting it makes the next run double-post. +# Remove this line to commit post history and let cold clones resume cleanly. +state/ + +# Secrets — never commit +.env +"; + + // Idempotency: only append entries that are not already present. + let existing = if gitignore_path.exists() { + fs::read_to_string(&gitignore_path)? + } else { + String::new() + }; - if gitignore_path.exists() { - let mut content = fs::read_to_string(&gitignore_path)?; - if content.contains(".govbot") { - eprintln!(" ✓ .gitignore already contains .govbot"); - } else { - if !content.ends_with('\n') { - content.push('\n'); + let mut updated = existing.clone(); + let mut added = Vec::new(); + for line in block.lines() { + let trimmed = line.trim(); + if trimmed.is_empty() || trimmed.starts_with('#') { + continue; + } + if !existing.lines().any(|l| l.trim() == trimmed) { + if !updated.is_empty() && !updated.ends_with('\n') { + updated.push('\n'); } - content.push_str(gitignore_entry); - fs::write(&gitignore_path, content)?; - eprintln!(" ✓ Updated .gitignore to include .govbot"); + updated.push_str(line); + updated.push('\n'); + added.push(trimmed.to_string()); } + } + + if existing.is_empty() { + fs::write(&gitignore_path, block)?; + eprintln!(" ✓ Created .gitignore"); + } else if !added.is_empty() { + fs::write(&gitignore_path, &updated)?; + eprintln!(" ✓ Updated .gitignore ({} entries added)", added.len()); } else { - fs::write(&gitignore_path, gitignore_entry)?; - eprintln!(" ✓ Created .gitignore with .govbot"); + eprintln!(" ✓ .gitignore already covers govbot's generated dirs"); } Ok(()) @@ -386,7 +368,7 @@ pub fn write_gitignore(cwd: &Path) -> Result<()> { fn github_workflow_content() -> &'static str { r#"# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. +# Runs govbot to pull datasets, apply classifications, and publish feeds. name: Build Govbot @@ -399,12 +381,8 @@ on: - cron: '0 0 * * *' workflow_dispatch: inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' required: false type: string @@ -419,7 +397,6 @@ jobs: - name: Run Govbot uses: chihacknight/govbot/actions/govbot@main with: - tags: ${{ inputs.tags }} limit: ${{ inputs.limit }} "# } diff --git a/actions/govbot/tapes/govbot-clone-list.tape b/actions/govbot/tapes/govbot-pull-list.tape similarity index 50% rename from actions/govbot/tapes/govbot-clone-list.tape rename to actions/govbot/tapes/govbot-pull-list.tape index 393ece0f..e925b0bc 100644 --- a/actions/govbot/tapes/govbot-clone-list.tape +++ b/actions/govbot/tapes/govbot-pull-list.tape @@ -1,5 +1,5 @@ -# govbot clone --list output -Output tapes/govbot-clone-list.gif +# govbot pull --list output +Output tapes/govbot-pull-list.gif Set Shell bash Set FontSize 14 @@ -7,6 +7,6 @@ Set Width 900 Set Height 800 Set Padding 20 -Type "govbot clone --list" +Type "govbot pull --list" Enter Sleep 2s diff --git a/actions/govbot/tapes/logs-basic.tape b/actions/govbot/tapes/logs-basic.tape deleted file mode 100644 index ce824cfb..00000000 --- a/actions/govbot/tapes/logs-basic.tape +++ /dev/null @@ -1,12 +0,0 @@ -# govbot logs output with mock data -Output tapes/logs-basic.gif - -Set Shell bash -Set FontSize 12 -Set Width 1200 -Set Height 600 -Set Padding 20 - -Type "govbot logs --govbot-dir mocks/.govbot" -Enter -Sleep 3s diff --git a/actions/govbot/tapes/source-basic.tape b/actions/govbot/tapes/source-basic.tape new file mode 100644 index 00000000..7cb1da99 --- /dev/null +++ b/actions/govbot/tapes/source-basic.tape @@ -0,0 +1,12 @@ +# govbot source output with mock data +Output tapes/source-basic.gif + +Set Shell bash +Set FontSize 12 +Set Width 1200 +Set Height 600 +Set Padding 20 + +Type "govbot source --govbot-dir mocks/.govbot" +Enter +Sleep 3s diff --git a/actions/govbot/tests/api_snaps.rs b/actions/govbot/tests/api_snaps.rs index 245b302e..536c98d7 100644 --- a/actions/govbot/tests/api_snaps.rs +++ b/actions/govbot/tests/api_snaps.rs @@ -1,10 +1,10 @@ -use govbot::prelude::*; use futures::StreamExt; +use govbot::prelude::*; use insta; /// Snapshot test for the pipeline processor -/// +/// /// This test processes log files and compares the output against stored snapshots. /// To update snapshots after making changes, run: /// cargo insta review @@ -12,7 +12,7 @@ use insta; async fn test_pipeline_processor_snapshot() { // Use the same test data directory as the example let git_dir = "tmp/git/repos"; - + // Build configuration matching the render-snapshots.sh script let config = ConfigBuilder::new(git_dir) .sort_order_str("DESC") @@ -21,7 +21,7 @@ async fn test_pipeline_processor_snapshot() { .join_options_str("bill") .unwrap() .build(); - + // Skip test if git_dir doesn't exist (e.g., in CI without test data) let config = match config { Ok(c) => c, @@ -37,7 +37,7 @@ async fn test_pipeline_processor_snapshot() { // Collect all entries from the stream let mut stream = processor.process(); let mut entries = Vec::new(); - + while let Some(result) = stream.next().await { match result { Ok(entry) => entries.push(entry), @@ -49,8 +49,8 @@ async fn test_pipeline_processor_snapshot() { } // Serialize to JSON for snapshot comparison - let json_output = serde_json::to_string_pretty(&entries) - .expect("Failed to serialize entries to JSON"); + let json_output = + serde_json::to_string_pretty(&entries).expect("Failed to serialize entries to JSON"); // Use insta's assert_snapshot! macro for string comparison // The snapshot will be stored in tests/snapshots/api_snapshot_tests__test_pipeline_processor_snapshot.snap @@ -88,4 +88,3 @@ async fn test_vote_event_processing() { // Test vote event result serialization insta::assert_json_snapshot!("vote_event_results", &results); } - diff --git a/actions/govbot/tests/bills_jsonl_compat.rs b/actions/govbot/tests/bills_jsonl_compat.rs new file mode 100644 index 00000000..5fc0c4b8 --- /dev/null +++ b/actions/govbot/tests/bills_jsonl_compat.rs @@ -0,0 +1,328 @@ +//! Back-compat contract test for `govbot logs` and the `bills.jsonl` shape. +//! +//! The chihacknight/govbot upgrade replaces the legacy `govbot logs` command +//! with `govbot source`. The most important downstream consumer is Frankie +//! Vegliante's `CHN-Bluesky-Govbot-Main` framework, which runs ~13 civic- +//! issue Bluesky bots, each driven by `govbot logs > bills.jsonl` in cron. +//! Its `scripts/post_to_bluesky.py` parser reads each line as a JSON object +//! and accesses a specific set of field paths. Breaking any of them silently +//! breaks every bot's next cron run. +//! +//! This test pins: +//! +//! 1. `govbot logs` runs (the back-compat alias survives). +//! 2. stdout is valid JSON-Lines. +//! 3. Every field path Frankie's parser accesses is present on at least +//! one record: +//! - `record.id` +//! - `record.timestamp` +//! - `record.bill.identifier` +//! - `record.bill.title` +//! - `record.bill.legislative_session` +//! - `record.bill.abstracts[].abstract` (when any abstract is present) +//! - `record.bill.subject` (when any subject is present) +//! - `record.log.action.description` +//! - `record.log.action.date` +//! - `record.sources` (nested values contain `state:`) +//! 4. State detection works — `\bstate:([a-z]{2})\b` matches somewhere in +//! the record on at least one line (Frankie's state-extraction regex). +//! 5. The dedup_key Frankie composes +//! (`f"{state}|{identifier}|{action_date}|{action_desc[:40]}"`) is +//! non-empty and stable across two consecutive invocations against the +//! same mock corpus. +//! +//! Anyone who changes the shape `govbot source` emits (which `govbot logs` +//! aliases) gets a red test here before Frankie's bots see a broken cron. + +use std::path::PathBuf; +use std::process::Command; + +use regex::Regex; +use serde_json::Value; + +/// Path to the freshly-built `govbot` binary. Mirrors the helper in +/// `run_repos_scope.rs` to keep this test binary self-contained. +fn govbot_binary() -> PathBuf { + let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + let status = Command::new("cargo") + .args(["build", "--bin", "govbot"]) + .current_dir(&manifest_dir) + .status() + .expect("cargo build should succeed"); + assert!(status.success(), "cargo build failed"); + manifest_dir.join("target").join("debug").join("govbot") +} + +/// Path to the in-tree mock corpus — the same fixture `just govbot source` +/// uses for dev runs (`actions/govbot/mocks/.govbot`). +fn mocks_govbot_dir() -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("mocks") + .join(".govbot") +} + +/// Run `govbot logs --govbot-dir ` and return (stdout, stderr). +fn run_logs(govbot_dir: &std::path::Path) -> (String, String) { + let bin = govbot_binary(); + let output = Command::new(&bin) + .arg("logs") + .arg("--govbot-dir") + .arg(govbot_dir) + .output() + .expect("spawn govbot logs"); + assert!( + output.status.success(), + "govbot logs exited with {:?}\nstderr:\n{}", + output.status.code(), + String::from_utf8_lossy(&output.stderr) + ); + ( + String::from_utf8(output.stdout).expect("stdout utf8"), + String::from_utf8(output.stderr).expect("stderr utf8"), + ) +} + +/// Parse JSON-Lines stdout into a `Vec`, skipping blank lines. +fn parse_jsonl(stdout: &str) -> Vec { + stdout + .lines() + .filter(|l| !l.trim().is_empty()) + .map(|l| { + serde_json::from_str::(l) + .unwrap_or_else(|e| panic!("invalid JSON line: {e}: {l}")) + }) + .collect() +} + +/// Build the dedup key Frankie's `scripts/post_to_bluesky.py` composes: +/// `f"{state}|{identifier}|{action_date}|{action_desc[:40]}"`. +fn dedup_key(record: &Value) -> Option { + let state = state_from_record(record)?; + let identifier = record + .get("bill") + .and_then(|b| b.get("identifier")) + .and_then(|v| v.as_str())?; + let action_date = record + .get("log") + .and_then(|l| l.get("action")) + .and_then(|a| a.get("date")) + .and_then(|v| v.as_str())?; + let action_desc = record + .get("log") + .and_then(|l| l.get("action")) + .and_then(|a| a.get("description")) + .and_then(|v| v.as_str())?; + let desc_head: String = action_desc.chars().take(40).collect(); + Some(format!("{state}|{identifier}|{action_date}|{desc_head}")) +} + +/// Frankie's state-detection regex: `\bstate:([a-z]{2})\b` searched against +/// the JSON-encoded record (his code walks `record["sources"]` and any +/// nested strings; serializing the whole record is the same surface). +fn state_from_record(record: &Value) -> Option { + let re = Regex::new(r"\bstate:([a-z]{2})\b").expect("regex compiles"); + let flat = serde_json::to_string(record).ok()?; + re.captures(&flat).map(|c| c[1].to_string()) +} + +/// (1) `govbot logs` survives the rename, (2) stdout is JSON-Lines, and (3) +/// every field path Frankie's parser touches is present on at least one +/// record. Coverage is "at least one record" — Frankie's parser walks the +/// stream and defends individual missing fields with `.get(...)` defaults; +/// the contract is that the SHAPE exists when the data does. +#[test] +fn govbot_logs_emits_every_field_frankie_reads() { + let govbot_dir = mocks_govbot_dir(); + assert!( + govbot_dir.exists(), + "mock corpus missing at {}; run from actions/govbot/", + govbot_dir.display() + ); + + let (stdout, stderr) = run_logs(&govbot_dir); + let records = parse_jsonl(&stdout); + assert!( + !records.is_empty(), + "govbot logs against the mock corpus emitted zero records — \ + the alias is wired up but Source produced no output. stderr:\n{stderr}" + ); + + // Top-level required-on-every-record fields. + for (i, r) in records.iter().enumerate() { + assert!( + r.get("id").and_then(|v| v.as_str()).is_some(), + "record[{i}] missing `id`: {r}" + ); + assert!( + r.get("timestamp").and_then(|v| v.as_str()).is_some(), + "record[{i}] missing `timestamp`: {r}" + ); + assert!( + r.get("bill").and_then(|v| v.as_object()).is_some(), + "record[{i}] missing `bill` object: {r}" + ); + assert!( + r.get("log").and_then(|v| v.as_object()).is_some(), + "record[{i}] missing `log` object: {r}" + ); + assert!( + r.get("sources").is_some(), + "record[{i}] missing `sources`: {r}" + ); + } + + // Required bill subfields present on every record (mock corpus does + // emit these for every bill). + for (i, r) in records.iter().enumerate() { + let bill = &r["bill"]; + assert!( + bill.get("identifier").and_then(|v| v.as_str()).is_some(), + "record[{i}].bill missing `identifier`: {bill}" + ); + assert!( + bill.get("title").and_then(|v| v.as_str()).is_some(), + "record[{i}].bill missing `title`: {bill}" + ); + assert!( + bill.get("legislative_session") + .and_then(|v| v.as_str()) + .is_some(), + "record[{i}].bill missing `legislative_session`: {bill}" + ); + } + + // Required log.action subfields on every record. + for (i, r) in records.iter().enumerate() { + let action = r["log"] + .get("action") + .expect(&format!("record[{i}].log.action missing")); + assert!( + action.get("description").and_then(|v| v.as_str()).is_some(), + "record[{i}].log.action missing `description`: {action}" + ); + assert!( + action.get("date").and_then(|v| v.as_str()).is_some(), + "record[{i}].log.action missing `date`: {action}" + ); + } + + // `bill.abstracts[].abstract` — must be present on at least one + // record when the underlying corpus has any abstract. The wy mock + // has bills with `abstracts: [{abstract:..., note:"summary"}]`. + let any_with_abstract = records.iter().any(|r| { + r["bill"] + .get("abstracts") + .and_then(|a| a.as_array()) + .map(|arr| { + arr.iter() + .any(|obj| obj.get("abstract").and_then(|v| v.as_str()).is_some()) + }) + .unwrap_or(false) + }); + assert!( + any_with_abstract, + "no record exposed `bill.abstracts[].abstract` — Frankie's \ + abstract-text fallback path will never trigger. The wy mock \ + is known to carry abstracts; if this fails the source-side \ + abstracts projection has regressed." + ); + + // `bill.subject` — Frankie's parser does `record["bill"].get("subject", [])`, + // so an absent key is tolerated (interpreted as no subjects). The hard + // contract is the inverse: if `subject` IS present, it must be an array + // of strings — anything else (object, scalar) would break his loop. + // (Non-empty subjects are projected from the mocks when present; the + // wy/gu mocks happen to ship `subject:[]` which Source omits by design, + // pinned by `ocd_entry_to_doc_omits_subjects_when_subject_array_is_empty` + // in main.rs.) + for (i, r) in records.iter().enumerate() { + if let Some(subj) = r["bill"].get("subject") { + assert!( + subj.is_array(), + "record[{i}].bill.subject is not an array (type breaks \ + Frankie's parser): {subj}" + ); + for (j, s) in subj.as_array().unwrap().iter().enumerate() { + assert!( + s.is_string(), + "record[{i}].bill.subject[{j}] is not a string: {s}" + ); + } + } + } + + // `record.sources` nested strings must contain `state:` somewhere + // — this is the regex anchor Frankie's parser uses to attribute a + // record to a US state for the dedup_key and per-bot routing. + let re = Regex::new(r"\bstate:([a-z]{2})\b").unwrap(); + let any_with_state = records.iter().any(|r| { + r.get("sources") + .map(|s| serde_json::to_string(s).unwrap_or_default()) + .map(|flat| re.is_match(&flat)) + .unwrap_or(false) + }); + assert!( + any_with_state, + "no record's `sources` contained a `state:` substring; \ + Frankie's state-extraction regex will fail on every bot." + ); + + // Belt-and-suspenders: at least one record matches the regex when + // serialized whole (sources or anywhere) — the broader form Frankie's + // parser actually walks. + let any_state_anywhere = records.iter().any(|r| state_from_record(r).is_some()); + assert!( + any_state_anywhere, + "no record yielded a state from the `\\bstate:([a-z]{{2}})\\b` \ + regex; Frankie's state attribution is dead." + ); + + // Deprecation warning lands on stderr (and ONLY stderr — stdout was + // parsed as JSON-Lines above; any leakage would have failed the + // `serde_json::from_str` line-by-line above). + assert!( + stderr.contains("`govbot logs` is deprecated"), + "stderr did not carry the deprecation warning; got:\n{stderr}" + ); +} + +/// (5) Dedup keys are non-empty and stable across two consecutive +/// invocations on the same mock data. Frankie's bots persist this key in +/// a posted-state ledger; instability would re-post every bill on every +/// cron run. +#[test] +fn dedup_key_is_nonempty_and_stable_across_runs() { + let govbot_dir = mocks_govbot_dir(); + + let (stdout_a, _) = run_logs(&govbot_dir); + let (stdout_b, _) = run_logs(&govbot_dir); + + let records_a = parse_jsonl(&stdout_a); + let records_b = parse_jsonl(&stdout_b); + + let keys_a: Vec = records_a.iter().filter_map(dedup_key).collect(); + let keys_b: Vec = records_b.iter().filter_map(dedup_key).collect(); + + assert!( + !keys_a.is_empty(), + "first invocation produced zero non-empty dedup keys; Frankie's \ + ledger would be empty and every bill would re-post forever." + ); + for k in &keys_a { + let parts: Vec<&str> = k.split('|').collect(); + assert_eq!( + parts.len(), + 4, + "dedup_key not of shape state|identifier|date|desc[:40]: {k}" + ); + for (i, p) in parts.iter().enumerate() { + assert!(!p.is_empty(), "dedup_key part {i} is empty in: {k}"); + } + } + + assert_eq!( + keys_a, keys_b, + "dedup keys diverged across two consecutive runs on the same mock \ + corpus — Frankie's bots would re-post every bill on every cron run." + ); +} diff --git a/actions/govbot/tests/cli_example_snaps.rs b/actions/govbot/tests/cli_example_snaps.rs index 3f902337..bbe959c6 100644 --- a/actions/govbot/tests/cli_example_snaps.rs +++ b/actions/govbot/tests/cli_example_snaps.rs @@ -119,14 +119,12 @@ fn run_example_script(script_path: &Path) -> (String, String, i32) { let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); let govbot_dir = manifest_dir.join("mocks").join(".govbot"); - // Set URL template to match existing mock data (uses -data-pipeline suffix) - let repo_url_template = "https://github.com/chn-openstates-files/{locale}-data-pipeline.git"; - + // The mock data's clone directories use the default `-legislation` suffix, + // so no `GOVBOT_REPO_SUFFIX` override is needed. let output = Command::new(&binary) .args(&args) .current_dir(&manifest_dir) .env("GOVBOT_DIR", govbot_dir.to_string_lossy().as_ref()) - .env("GOVBOT_REPO_URL_TEMPLATE", repo_url_template) .output() .expect("Failed to execute command"); @@ -179,8 +177,8 @@ fn format_snapshot_with_script(script_path: &Path, output: &str) -> String { /// Check if a script requires test data to run fn script_requires_test_data(script_path: &Path) -> bool { if let Ok(content) = fs::read_to_string(script_path) { - // Commands that need test data (repos directory) - content.contains("govbot logs") + // Commands that need test data (datasets directory) + content.contains("govbot source") } else { false } diff --git a/actions/govbot/tests/fixtures/frankie_transportation_config.yml b/actions/govbot/tests/fixtures/frankie_transportation_config.yml new file mode 100644 index 00000000..a9238a66 --- /dev/null +++ b/actions/govbot/tests/fixtures/frankie_transportation_config.yml @@ -0,0 +1,26 @@ +# Minimal-but-realistic Frankie-style topic config for tests. +# Shape mirrors CHN-Bluesky-Govbot's topics/transportation/config.yml. +name: transportation +display_name: Transportation +default_emoji: "🚗" +keywords: + - public transit + - bus rapid transit + - light rail + - high-speed rail + - bike lane + - pedestrian safety + - vision zero + - electric vehicle + - EV charging + - road infrastructure +emoji_map: + rail: "🚆" + bus: "🚌" + bicycle: "🚲" +digest_title: "🗳️ Transportation Bills Weekly Digest" +topic: "transportation" +# Permissive — Frankie configs may carry extra fields the migration tool +# does not translate yet. They must parse without error. +schedule: weekly +timezone: America/Chicago diff --git a/actions/govbot/tests/from_frankie_config.rs b/actions/govbot/tests/from_frankie_config.rs new file mode 100644 index 00000000..3c121f05 --- /dev/null +++ b/actions/govbot/tests/from_frankie_config.rs @@ -0,0 +1,188 @@ +//! Integration test for `govbot init --from-frankie-config ` — the +//! migration tool that scaffolds a govbot+fastclass project from an existing +//! Frankie-style topic config. +//! +//! Asserts the produced skeleton: +//! - Has a valid govbot manifest (`govbot.yml` parses). +//! - Has a classifier bundle with exactly one tag named after the Frankie +//! `name`, whose `include_keywords` equal the fixture's keyword list. +//! - Has a `fusion.yml` declaring the portable `models:` block. +//! - Refuses to overwrite an existing project (idempotency guard). +//! +//! Mirrors the style of `run_repos_scope.rs` — builds the binary, runs it as +//! a subprocess, and inspects the on-disk output. + +use std::fs; +use std::path::PathBuf; +use std::process::Command; + +fn govbot_binary() -> PathBuf { + let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + let status = Command::new("cargo") + .args(["build", "--bin", "govbot"]) + .current_dir(&manifest_dir) + .status() + .expect("cargo build should succeed"); + assert!(status.success(), "cargo build failed"); + manifest_dir.join("target").join("debug").join("govbot") +} + +fn fixture_path() -> PathBuf { + PathBuf::from(env!("CARGO_MANIFEST_DIR")) + .join("tests") + .join("fixtures") + .join("frankie_transportation_config.yml") +} + +#[test] +fn from_frankie_config_scaffolds_a_valid_govbot_project() { + let bin = govbot_binary(); + let fixture = fixture_path(); + let tmp = tempfile::tempdir().expect("tempdir"); + let into = tmp.path().join("scratch-transport"); + + // --- Run: govbot init --from-frankie-config --into --- + let output = Command::new(&bin) + .args([ + "init", + "--from-frankie-config", + fixture.to_str().unwrap(), + "--into", + into.to_str().unwrap(), + ]) + .output() + .expect("govbot init should execute"); + assert!( + output.status.success(), + "govbot init failed: stdout={} stderr={}", + String::from_utf8_lossy(&output.stdout), + String::from_utf8_lossy(&output.stderr) + ); + + // --- 1. govbot.yml parses as a valid manifest. --- + let manifest_path = into.join("govbot.yml"); + assert!( + manifest_path.exists(), + "expected scaffolded govbot.yml at {}", + manifest_path.display() + ); + let manifest = + govbot::config::Manifest::load(&manifest_path).expect("scaffolded govbot.yml should parse"); + assert_eq!( + manifest.datasets, + vec!["all".to_string()], + "scaffolded manifest should default to datasets: [all]" + ); + assert!( + manifest.transforms.contains_key("classify"), + "manifest should declare a `classify` transform" + ); + assert!( + manifest.publish.contains_key("bluesky"), + "manifest should declare a `bluesky` publisher" + ); + // bluesky select carries the topic name (single tag = the topic). + let bluesky = manifest.publish.get("bluesky").unwrap(); + assert_eq!( + bluesky.select.as_deref().map(|s| s.to_vec()), + Some(vec!["transportation".to_string()]) + ); + + // --- 2. classifier.yml has one tag named after `name`; keywords match. --- + let classifier_yml_path = into.join("classifier").join("classifier.yml"); + assert!(classifier_yml_path.exists(), "classifier.yml should exist"); + let raw = fs::read_to_string(&classifier_yml_path).expect("read classifier.yml"); + let parsed: serde_yaml::Value = serde_yaml::from_str(&raw).expect("classifier.yml is YAML"); + let tags = parsed + .get("tags") + .and_then(|v| v.as_mapping()) + .expect("classifier.yml should carry a `tags:` mapping"); + assert_eq!( + tags.len(), + 1, + "scaffolded classifier should carry exactly one tag (the Frankie topic name)" + ); + let tag = tags + .get(serde_yaml::Value::String("transportation".to_string())) + .expect("the single tag should be named after the Frankie `name`"); + let include_keywords: Vec = tag + .get("include_keywords") + .and_then(|v| v.as_sequence()) + .expect("tag should carry include_keywords") + .iter() + .map(|v| v.as_str().expect("keyword is a string").to_string()) + .collect(); + let expected_keywords = vec![ + "public transit", + "bus rapid transit", + "light rail", + "high-speed rail", + "bike lane", + "pedestrian safety", + "vision zero", + "electric vehicle", + "EV charging", + "road infrastructure", + ]; + assert_eq!( + include_keywords, expected_keywords, + "include_keywords should mirror the Frankie keyword list verbatim" + ); + + // --- 3. fusion.yml declares the portable models: block. --- + let fusion_path = into.join("classifier").join("fusion.yml"); + assert!(fusion_path.exists(), "fusion.yml should exist"); + let fusion_raw = fs::read_to_string(&fusion_path).expect("read fusion.yml"); + let fusion: serde_yaml::Value = + serde_yaml::from_str(&fusion_raw).expect("fusion.yml should parse"); + let models = fusion + .get("models") + .and_then(|v| v.as_mapping()) + .expect("fusion.yml should declare a `models:` block"); + assert!( + models.contains_key(serde_yaml::Value::String("encoder".to_string())), + "models: should declare an encoder" + ); + assert!( + models.contains_key(serde_yaml::Value::String("reranker".to_string())), + "models: should declare a reranker" + ); + + // --- supporting files exist --- + assert!(into + .join("classifier") + .join("eval") + .join("constitution.yml") + .exists()); + assert!(into + .join("classifier") + .join("eval") + .join("rolling.yml") + .exists()); + assert!(into.join("classifier").join("proposals").exists()); + assert!(into.join("summarizer").join("prompt.md").exists()); + assert!(into.join("README.md").exists()); + assert!(into.join(".gitignore").exists()); + + // --- 4. Re-running into the same dir refuses to overwrite. --- + let rerun = Command::new(&bin) + .args([ + "init", + "--from-frankie-config", + fixture.to_str().unwrap(), + "--into", + into.to_str().unwrap(), + ]) + .output() + .expect("re-run should execute"); + assert!( + !rerun.status.success(), + "re-running --from-frankie-config into an existing project must fail" + ); + let stderr = String::from_utf8_lossy(&rerun.stderr); + assert!( + stderr.contains("already exists") || stderr.contains("refusing"), + "stderr should explain the overwrite guard; got: {}", + stderr + ); +} diff --git a/actions/govbot/tests/run_repos_scope.rs b/actions/govbot/tests/run_repos_scope.rs new file mode 100644 index 00000000..bb8360c2 --- /dev/null +++ b/actions/govbot/tests/run_repos_scope.rs @@ -0,0 +1,184 @@ +//! Regression test for the `datasets:[wy]` scope leak. +//! +//! `govbot::pipeline::run_transform_dag` spawns `govbot source --select docs` +//! as the head of the classify pipeline. Pre-fix it never passed `--repos`, +//! so the manifest's `datasets:` was silently ignored at the source step: a +//! manifest declaring `datasets: [wy]` in a project whose `.govbot/repos/` +//! held 52 datasets (left over from an earlier `[all]` pull) classified +//! ~4900 records across every state instead of ~100 Wyoming records. +//! +//! The fix translates `manifest.datasets` to a `--repos ` argv that +//! gets appended to the source spawn. This test pins the two invariants the +//! fix relies on: +//! +//! 1. `govbot source --select docs --repos ...` against a +//! multi-dataset cache emits records only from the named dataset(s). +//! This is the source-side scoping the pipeline relies on — if it ever +//! regresses, the pipeline's `--repos` plumbing is moot. +//! 2. Omitting `--repos` walks every linked dataset — the "every dataset" +//! sentinel `source_repos_from_manifest` produces for a `[all]` +//! manifest, so the pipeline can keep treating absence as "all". +//! +//! Together with the `source_repos_from_manifest` unit test in +//! `pipeline.rs` (which pins the manifest→argv translation), these +//! invariants regression-test the full fix path. + +use std::fs; +use std::path::PathBuf; +use std::process::Command; + +/// Path to the freshly-built `govbot` binary. Mirrors the helper in +/// `cli_example_snaps.rs` but kept local so the two integration test +/// binaries stay independent. +fn govbot_binary() -> PathBuf { + let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); + let status = Command::new("cargo") + .args(["build", "--bin", "govbot"]) + .current_dir(&manifest_dir) + .status() + .expect("cargo build should succeed"); + assert!(status.success(), "cargo build failed"); + manifest_dir.join("target").join("debug").join("govbot") +} + +/// Build a throwaway `.govbot/repos/` tree with two datasets (`wy`, `gu`), +/// each holding one bill with one log file. Returns the absolute path to the +/// `.govbot` root (the value to pass as `--govbot-dir`). +/// +/// We materialise the on-disk corpus by hand rather than re-using +/// `actions/govbot/mocks/` because the shipped mock's `gu-legislation/` has +/// metadata but no logs — `govbot source` emits per-log entries, so a +/// "did the filter scope to wy" assertion is vacuous if `gu` has no logs to +/// scope away. +fn build_two_dataset_corpus(tmp: &std::path::Path) -> PathBuf { + let repos = tmp.join(".govbot").join("repos"); + + for (dataset, state, bill_id) in [ + ("wy-legislation", "wy", "HB0001"), + ("gu-legislation", "gu", "B1-38"), + ] { + // Layout the walker expects: `country:/state:/sessions//bills//logs/_.json`. + let session = if state == "wy" { "2025" } else { "38th" }; + let bill_dir = repos + .join(dataset) + .join("country:us") + .join(format!("state:{}", state)) + .join("sessions") + .join(session) + .join("bills") + .join(bill_id); + let logs_dir = bill_dir.join("logs"); + fs::create_dir_all(&logs_dir).expect("create logs dir"); + + // A minimal metadata.json — `source --select docs` joins it for the + // doc text. The timestamp on the log filename is what the source + // walker sorts by; the suffix names the action. + fs::write( + bill_dir.join("metadata.json"), + serde_json::json!({ + "title": format!("Test bill {}", bill_id), + "identifier": bill_id, + "subjects": ["test"], + "abstracts": [{"abstract": format!("Body of {}", bill_id)}], + }) + .to_string(), + ) + .expect("write metadata.json"); + + // A "passage" log — substantive under `--filter default`, so the + // record survives the default filter and shows up in `--select docs`. + fs::write( + logs_dir.join("20250129T022703Z_passage.json"), + serde_json::json!({ + "action": "passage", + "bill_id": bill_id, + "date": "2025-01-29", + }) + .to_string(), + ) + .expect("write log"); + } + + tmp.join(".govbot") +} + +/// Collect `govbot source --select docs` stdout against the throwaway +/// corpus, parsed into one JSON value per non-empty line. The `--filter +/// none` keeps every log entry — we want the count to depend only on +/// `--repos` scoping, not on the per-dataset action filters. +fn source_docs(govbot_dir: &std::path::Path, repos: &[&str]) -> Vec { + let bin = govbot_binary(); + let mut cmd = Command::new(&bin); + cmd.arg("source") + .arg("--select") + .arg("docs") + .arg("--filter") + .arg("none") + .arg("--govbot-dir") + .arg(govbot_dir); + if !repos.is_empty() { + cmd.arg("--repos"); + for r in repos { + cmd.arg(r); + } + } + let output = cmd.output().expect("spawn govbot source"); + assert!( + output.status.success(), + "govbot source exited with {:?}\nstderr:\n{}", + output.status.code(), + String::from_utf8_lossy(&output.stderr) + ); + String::from_utf8_lossy(&output.stdout) + .lines() + .filter(|l| !l.trim().is_empty()) + .filter_map(|l| serde_json::from_str::(l).ok()) + .collect() +} + +/// Pin invariant (1): `--repos wy` against a `wy+gu` corpus emits only `wy` +/// records. This is the source-side guarantee the pipeline relies on. +#[test] +fn source_with_repos_scopes_to_named_dataset() { + let tmp = tempfile::tempdir().expect("tmpdir"); + let govbot_dir = build_two_dataset_corpus(tmp.path()); + + let wy_only = source_docs(&govbot_dir, &["wy"]); + assert!( + !wy_only.is_empty(), + "wy corpus should emit at least one record" + ); + for record in &wy_only { + let id = record + .get("id") + .and_then(|v| v.as_str()) + .expect("doc record must have a string `id`"); + assert!( + id.starts_with("wy-legislation/"), + "--repos wy leaked a non-wy record: {}", + id + ); + } +} + +/// Pin invariant (2): omitting `--repos` walks every linked dataset. This is +/// what `source_repos_from_manifest(&["all"])` returns (empty list → flag +/// omitted), and the pipeline relies on that translation matching source's +/// own "all" sentinel. +#[test] +fn source_without_repos_walks_every_dataset() { + let tmp = tempfile::tempdir().expect("tmpdir"); + let govbot_dir = build_two_dataset_corpus(tmp.path()); + + let all = source_docs(&govbot_dir, &[]); + let datasets: std::collections::BTreeSet = all + .iter() + .filter_map(|r| r.get("id").and_then(|v| v.as_str())) + .filter_map(|id| id.split('/').next().map(str::to_string)) + .collect(); + assert!( + datasets.contains("wy-legislation") && datasets.contains("gu-legislation"), + "no-`--repos` walk should hit both datasets, got: {:?}", + datasets + ); +} diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap index 773949e9..5a2a0cce 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_help.snap @@ -1,24 +1,33 @@ --- source: tests/cli_example_snaps.rs +assertion_line: 221 expression: "&formatted_stdout" --- Command: govbot --help Output: -Process pipeline log files with type-safe reactive streams +govbot — a 4-tool civic-data publishing stack. (1) Select real gov data: pull the legislation of all 50 states, DC, territories, and federal Congress from a content-addressed dataset registry. (2) Filter/transform: run transforms over the stream — fastclass tagging today, local-LLM summarize on the roadmap. (3) Publish with receipts: RSS / HTML / JSON / DuckDB / Bluesky today, plus a roadmap GitHub Pages 'receipts' artifact that carries deterministic provenance behind every AI digest. (4) Coding-agent-native dev experience: AGENT.md walks Claude Code through make / manage / update of a project. Configured by a govbot.yml manifest (datasets / transforms / publish / pipelines). See AGENT.md for the end-user playbook, README for the honest gap map. Usage: govbot [COMMAND] Commands: - clone Clone or pull data pipeline repositories (default: updates existing repos) Clones if repository doesn't exist, pulls if it does Use "govbot clone all" to clone all repos, or "govbot clone " for specific repos - logs Process and display pipeline log files - delete Delete data pipeline repositories Deletes local repository directories for specified locales - load Load bill metadata into a DuckDB database file Loads all metadata.json files from cloned repos into a DuckDB database for analysis. The database file is saved in the base govbot directory (e.g., ./.govbot/govbot.duckdb) - update Update govbot to the latest nightly version Downloads and installs the latest nightly build from GitHub releases - build Build RSS feed and HTML index from govbot.yml configuration Generates a combined RSS feed and HTML index from logs filtered by tags in govbot.yml - tag Tag bills using semantic or built-in similarity based on govbot.yml in the current directory. Reads JSON lines from stdin (from `govbot logs`), processes entries with bill identifiers, and writes per-tag files under the directory containing govbot.yml. By default, acts as a filter: only outputs lines that match tags. If a tag name is provided, only processes and outputs lines matching that specific tag - help Print this message or the help of the given subcommand(s) + pull Pull (clone or update) dataset repositories into the shared cache and link them into the project. Use `govbot pull all` to pull every dataset, `govbot pull ...` for specific ones, or `govbot pull` with no args to refresh whatever's already linked into the project + source Stream dataset records as JSON Lines — the govbot stream-protocol `source` stage. Pipe into a transform (`fastclass classify -`) or into `govbot apply` for the persistence sink. See `schemas/STREAM_PROTOCOL.md` + delete Delete locally-linked dataset clones from the project's `.govbot/repos/`. Use `govbot delete all` to clear every linked dataset, or `govbot delete ...` for specific ones. The shared cache at `~/.govbot/cache/` is not touched — a subsequent `pull` re-links instantly + load Load bill metadata into a DuckDB database for SQL analysis. Walks every linked dataset's `metadata.json` files, creates a `bills` table + a `bills_summary` view, and writes the database into the base govbot directory (default `./.govbot/govbot.duckdb`). Requires the `duckdb` CLI on PATH + update Update the installed govbot binary to the latest nightly build from GitHub releases. Installs into `~/.govbot/bin/govbot` and prefers the platform-native `.tar.gz` asset + publish Run one or more publishers from `govbot.yml: publish:`. A publisher consumes the tagged result stream and emits artifacts: `rss`/`html`/`json` write feed/index/dump files, `duckdb` loads records into a database, `bluesky` posts matches to a Bluesky account (always dry-run first with `--dry-run`) + apply Persist fastclass classification results as tag files under the project's `tags/` output directory. Reads `fastclass classify` result JSON from stdin — the apply sink of `govbot source --select docs | fastclass classify - | govbot apply` — and writes per-tag `.tag.json` files under `/tags//country:.../sessions//`, the files `govbot publish` turns into feeds. Classification itself is done by fastclass; `govbot apply` only stores the results. `tags/` is a project-rooted classification-output dir — peer to `dist/` (publisher output) and distinct from `.govbot/` (the tool's regenerable cache) + run Run the full pipeline against the current directory's `govbot.yml`: pull/update datasets → `source --select docs | fastclass classify - | apply` (the classify transform) → publish every configured publisher. `govbot` with no arguments is equivalent (and falls back to `init` if no `govbot.yml` is present) + init Scaffold a new govbot.yml in the current directory (the setup wizard). Interactive in a TTY; writes sensible defaults when non-interactive + add Add one or more datasets to the project's `govbot.yml` `datasets:` list. Each id is validated against the registry before it is added + remove Remove one or more datasets from the project's `govbot.yml` + ls List datasets — the project's manifest datasets and the ones cached locally. With no manifest, lists every dataset in the registry + search Search the dataset registry. A blank query lists every dataset + doctor Check that the project's pulled datasets are coherent. A data-integrity smoke test, runnable after `govbot pull all` or before `govbot run` in production. Walks every linked dataset and verifies that the `govbot source --select docs` stream is well-formed: every linked dataset entry resolves to a real directory, per-dataset ids don't collapse onto a handful (the bug-7592418 signature), every sampled `id` resolves to a present and parseable `metadata.json`, and every sampled `text` is non-trivial. Zero-record datasets are surfaced as warnings rather than errors — `--filter default` can legitimately drop every routine log. Exits non-zero on any failure so it can drop straight into a CI step. Skips cleanly when the cache is empty — this is a smoke test, not a unit test + logs **Deprecated.** Alias for `govbot source` (default mode) preserved so existing consumers (the CHN-Bluesky-Govbot-Main framework, anyone running `govbot logs > bills.jsonl`) keep working after the Logs→Source rename. Prints a deprecation warning to stderr on invocation. Will be removed in a future major version + help Print this message or the help of the given subcommand(s) Options: -h, --help Print help diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_clone_list.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_pull_list.snap similarity index 80% rename from actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_clone_list.snap rename to actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_pull_list.snap index 6679a69a..0c6b783e 100644 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_clone_list.snap +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@govbot_pull_list.snap @@ -3,15 +3,17 @@ source: tests/cli_example_snaps.rs expression: "&formatted_stdout" --- Command: -govbot clone --list +govbot pull --list Output: -Available repos: +Available datasets: ak al ar + az ca co + ct de fl ga @@ -50,12 +52,14 @@ Available repos: sc sd tn + tx usa ut + va vi vt wa wi wv wy - all (clone all repos) + all (pull every dataset) diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap deleted file mode 100644 index 3802b2b7..00000000 --- a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@logs_basic.snap +++ /dev/null @@ -1,8 +0,0 @@ ---- -source: tests/cli_example_snaps.rs -expression: "&formatted_stdout" ---- -Command: -govbot logs - -Output: diff --git a/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap new file mode 100644 index 00000000..9adc20f7 --- /dev/null +++ b/actions/govbot/tests/snapshots/cli_example_snaps__snapshot@source_basic.snap @@ -0,0 +1,14 @@ +--- +source: tests/cli_example_snaps.rs +assertion_line: 221 +expression: "&formatted_stdout" +--- +Command: +govbot source + +Output: +{"bill":{"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0001","legislative_session":"2025","title":"General government appropriations-2."},"id":"HB0001","log":{"action":{"classification":["filing"],"date":"2025-01-29T17:06:54+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0001"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0001/logs/20250129T170654Z_h_received_for_introduction.json"},"timestamp":"20250129T170654Z"} +{"bill":{"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0002","legislative_session":"2025","title":"Hunting license application fees increase."},"id":"HB0002","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:16:16+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0002"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0002/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0002/logs/20250102T191616Z_h_received_for_introduction.json"},"timestamp":"20250102T191616Z"} +{"bill":{"abstracts":[{"abstract":"2025/Summaries/HB0005.pdf","note":"summary"}],"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0005","legislative_session":"2025","title":"Fishing outfitters and guides-registration of fishing boats."},"id":"HB0005","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:18:48+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0005"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0005/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0005/logs/20250102T191848Z_h_received_for_introduction.json"},"timestamp":"20250102T191848Z"} +{"bill":{"abstracts":[{"abstract":"2025/Summaries/HB0004.pdf","note":"summary"}],"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0004","legislative_session":"2025","title":"Snowmobile registration and user fees."},"id":"HB0004","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:17:44+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0004"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0004/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0004/logs/20250102T191744Z_h_received_for_introduction.json"},"timestamp":"20250102T191744Z"} +{"bill":{"from_organization":"~{\"classification\": \"lower\"}","identifier":"HB0003","legislative_session":"2025","title":"Animal abuse-predatory animals."},"id":"HB0003","log":{"action":{"classification":["filing"],"date":"2025-01-02T19:17:11+00:00","description":"H Received for Introduction","organization_id":"~{\"classification\": \"lower\"}"},"bill_id":"HB0003"},"sources":{"bill":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0003/metadata.json","log":"wy-legislation/country:us/state:wy/sessions/2025/bills/HB0003/logs/20250102T191711Z_h_received_for_introduction.json"},"timestamp":"20250102T191711Z"} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap new file mode 100644 index 00000000..d2b6d0f0 --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_all.snap @@ -0,0 +1,37 @@ +--- +source: tests/wizard_tests.rs +assertion_line: 58 +expression: "&yml" +--- +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - all + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "feed.xml" + site: + type: html + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "index.html" + +pipelines: + default: + - classify + - feed + - site diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap deleted file mode 100644 index a3ae97fb..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_no_tag.snap +++ /dev/null @@ -1,24 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap deleted file mode 100644 index df8d77bf..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_all_with_tag.snap +++ /dev/null @@ -1,36 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://myuser.github.io/my-govbot" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap new file mode 100644 index 00000000..fa4c60af --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all.snap @@ -0,0 +1,97 @@ +--- +source: tests/wizard_tests.rs +assertion_line: 18 +expression: "&session.to_snapshot()" +--- +=== Wizard Session === + +Welcome to govbot! Let's set up your project. + +? What datasets do you want to track? +> All jurisdictions in the registry + Select specific datasets + +Classification is done by fastclass against a classifier bundle. +Point the manifest's `transforms.classify.classifier` at your +bundle directory (containing classifier.yml). See the fastclass +docs to build one. + +Publishing is configured for an RSS feed + HTML index by default. +Both land in the "docs" directory (feed.xml + index.html). + +? Base URL for your feeds: https://myuser.github.io/my-govbot + + ✓ Created govbot.yml + ✓ Created .gitignore + ✓ Created .github/workflows/build.yml + +Setup complete! Run 'govbot' again to start the pipeline. + +=== Generated: govbot.yml === + +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - all + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "feed.xml" + site: + type: html + base_url: "https://myuser.github.io/my-govbot" + output_dir: "docs" + output_file: "index.html" + +pipelines: + default: + - classify + - feed + - site + +=== Generated: .github/workflows/build.yml === + +# Run Govbot +# Runs govbot to pull datasets, apply classifications, and publish feeds. + +name: Build Govbot + +on: + push: + branches: + - main + - master + schedule: + - cron: '0 0 * * *' + workflow_dispatch: + inputs: + limit: + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' + required: false + type: string + +jobs: + govbot: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Run Govbot + uses: chihacknight/govbot/actions/govbot@main + with: + limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap deleted file mode 100644 index 127602d2..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_own_tags.snap +++ /dev/null @@ -1,122 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? -> All states (47 jurisdictions) - Select specific states - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? - Use the example "education" tag to start -> I'll create my own tags later - -To create a tag, copy this prompt into your preferred AI tool: - ---- -Create a govbot tag definition in YAML for tracking [YOUR TOPIC] legislation. -The tag should have: -- A description (multiline, covering subtopics) -- 2-3 example bill descriptions that would match -- Optional: include_keywords and exclude_keywords lists - -Format: - tag_name: - description: | - ... - examples: - - "..." - include_keywords: - - keyword1 - exclude_keywords: - - keyword1 ---- - -Paste the result into your govbot.yml under the 'tags:' section. - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://example.com - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap deleted file mode 100644 index b0d13d03..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_all_with_tag.snap +++ /dev/null @@ -1,111 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? -> All states (47 jurisdictions) - Select specific states - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? -> Use the example "education" tag to start - I'll create my own tags later - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://myuser.github.io/my-govbot - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - all - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://myuser.github.io/my-govbot" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap index e5413262..c1d95854 100644 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_single_state.snap @@ -1,90 +1,75 @@ --- source: tests/wizard_tests.rs +assertion_line: 44 expression: "&session.to_snapshot()" --- === Wizard Session === Welcome to govbot! Let's set up your project. -? What data sources do you want to track? - All states (47 jurisdictions) -> Select specific states +? What datasets do you want to track? + All jurisdictions in the registry +> Select specific datasets -Available states/jurisdictions: - ak, al, ar, ca, co, de, fl, ga, gu, hi - ia, id, il, in, ks, ky, la, ma, md, me - mi, mn, mo, mp, ms, mt, nc, nd, ne, nh - nj, nm, nv, ny, oh, ok, or, pa, pr, ri - sc, sd, tn, usa, ut, vi, vt, wa, wi, wv - wy +Browse the registry with `govbot search`. -? Enter state codes separated by spaces: wy +? Enter dataset ids separated by spaces: wy -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: +Classification is done by fastclass against a classifier bundle. +Point the manifest's `transforms.classify.classifier` at your +bundle directory (containing classifier.yml). See the fastclass +docs to build one. - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? -> Use the example "education" tag to start - I'll create my own tags later - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. +Publishing is configured for an RSS feed + HTML index by default. +Both land in the "docs" directory (feed.xml + index.html). ? Base URL for your feeds: https://sartaj.me/govbot ✓ Created govbot.yml - ✓ Created .gitignore with .govbot + ✓ Created .gitignore ✓ Created .github/workflows/build.yml Setup complete! Run 'govbot' again to start the pipeline. === Generated: govbot.yml === -# Govbot Configuration +# Govbot Manifest # Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json $schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -repos: +datasets: - wy -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://sartaj.me/govbot" - output_dir: "docs" - output_file: "feed.xml" +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "feed.xml" + site: + type: html + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "index.html" + +pipelines: + default: + - classify + - feed + - site === Generated: .github/workflows/build.yml === # Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. +# Runs govbot to pull datasets, apply classifications, and publish feeds. name: Build Govbot @@ -97,12 +82,8 @@ on: - cron: '0 0 * * *' workflow_dispatch: inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' required: false type: string @@ -117,5 +98,4 @@ jobs: - name: Run Govbot uses: chihacknight/govbot/actions/govbot@main with: - tags: ${{ inputs.tags }} limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap new file mode 100644 index 00000000..013135da --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific.snap @@ -0,0 +1,103 @@ +--- +source: tests/wizard_tests.rs +assertion_line: 31 +expression: "&session.to_snapshot()" +--- +=== Wizard Session === + +Welcome to govbot! Let's set up your project. + +? What datasets do you want to track? + All jurisdictions in the registry +> Select specific datasets + +Browse the registry with `govbot search`. + +? Enter dataset ids separated by spaces: il ca ny + +Classification is done by fastclass against a classifier bundle. +Point the manifest's `transforms.classify.classifier` at your +bundle directory (containing classifier.yml). See the fastclass +docs to build one. + +Publishing is configured for an RSS feed + HTML index by default. +Both land in the "docs" directory (feed.xml + index.html). + +? Base URL for your feeds: https://activist.github.io/legislation + + ✓ Created govbot.yml + ✓ Created .gitignore + ✓ Created .github/workflows/build.yml + +Setup complete! Run 'govbot' again to start the pipeline. + +=== Generated: govbot.yml === + +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - il + - ca + - ny + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://activist.github.io/legislation" + output_dir: "docs" + output_file: "feed.xml" + site: + type: html + base_url: "https://activist.github.io/legislation" + output_dir: "docs" + output_file: "index.html" + +pipelines: + default: + - classify + - feed + - site + +=== Generated: .github/workflows/build.yml === + +# Run Govbot +# Runs govbot to pull datasets, apply classifications, and publish feeds. + +name: Build Govbot + +on: + push: + branches: + - main + - master + schedule: + - cron: '0 0 * * *' + workflow_dispatch: + inputs: + limit: + description: 'Limit number of entries per artifact (default: 100, use "none" for all)' + required: false + type: string + +jobs: + govbot: + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Run Govbot + uses: chihacknight/govbot/actions/govbot@main + with: + limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap deleted file mode 100644 index 727838b5..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_own_tags.snap +++ /dev/null @@ -1,134 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? - All states (47 jurisdictions) -> Select specific states - -Available states/jurisdictions: - ak, al, ar, ca, co, de, fl, ga, gu, hi - ia, id, il, in, ks, ky, la, ma, md, me - mi, mn, mo, mp, ms, mt, nc, nd, ne, nh - nj, nm, nv, ny, oh, ok, or, pa, pr, ri - sc, sd, tn, usa, ut, vi, vt, wa, wi, wv - wy - -? Enter state codes separated by spaces: il ca ny - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? - Use the example "education" tag to start -> I'll create my own tags later - -To create a tag, copy this prompt into your preferred AI tool: - ---- -Create a govbot tag definition in YAML for tracking [YOUR TOPIC] legislation. -The tag should have: -- A description (multiline, covering subtopics) -- 2-3 example bill descriptions that would match -- Optional: include_keywords and exclude_keywords lists - -Format: - tag_name: - description: | - ... - examples: - - "..." - include_keywords: - - keyword1 - exclude_keywords: - - keyword1 ---- - -Paste the result into your govbot.yml under the 'tags:' section. - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://example.com - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - il - - ca - - ny - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap deleted file mode 100644 index 528b4933..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_session_specific_with_tag.snap +++ /dev/null @@ -1,123 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&session.to_snapshot()" ---- -=== Wizard Session === - -Welcome to govbot! Let's set up your project. - -? What data sources do you want to track? - All states (47 jurisdictions) -> Select specific states - -Available states/jurisdictions: - ak, al, ar, ca, co, de, fl, ga, gu, hi - ia, id, il, in, ks, ky, la, ma, md, me - mi, mn, mo, mp, ms, mt, nc, nd, ne, nh - nj, nm, nv, ny, oh, ok, or, pa, pr, ri - sc, sd, tn, usa, ut, vi, vt, wa, wi, wv - wy - -? Enter state codes separated by spaces: il ca ny - -Tags let govbot categorize legislation by topics you care about. -Here's an example tag definition: - - education: - description: | - Legislation related to schools, education funding, - curriculum standards, and educational policy. - examples: - - "Increases per-pupil funding for public schools" - - "Mandates comprehensive sex education curriculum" - -? How would you like to set up tags? -> Use the example "education" tag to start - I'll create my own tags later - -Publishing is configured for RSS feeds by default. -Your feeds will be generated in the "docs" directory. - -? Base URL for your feeds: https://activist.github.io/legislation - - ✓ Created govbot.yml - ✓ Created .gitignore with .govbot - ✓ Created .github/workflows/build.yml - -Setup complete! Run 'govbot' again to start the pipeline. - -=== Generated: govbot.yml === - -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - il - - ca - - ny - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://activist.github.io/legislation" - output_dir: "docs" - output_file: "feed.xml" - -=== Generated: .github/workflows/build.yml === - -# Run Govbot -# Runs govbot to clone repos, tag bills, and build RSS feeds and HTML index. - -name: Build Govbot - -on: - push: - branches: - - main - - master - schedule: - - cron: '0 0 * * *' - workflow_dispatch: - inputs: - tags: - description: 'Comma-separated list of tags to include (leave empty for all tags)' - required: false - type: string - limit: - description: 'Limit number of entries per feed (default: 15, use "none" for all)' - required: false - type: string - -jobs: - govbot: - runs-on: ubuntu-latest - - steps: - - name: Checkout repository - uses: actions/checkout@v4 - - - name: Run Govbot - uses: chihacknight/govbot/actions/govbot@main - with: - tags: ${{ inputs.tags }} - limit: ${{ inputs.limit }} diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap new file mode 100644 index 00000000..c504f67e --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_single.snap @@ -0,0 +1,37 @@ +--- +source: tests/wizard_tests.rs +assertion_line: 81 +expression: "&yml" +--- +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - wy + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "feed.xml" + site: + type: html + base_url: "https://sartaj.me/govbot" + output_dir: "docs" + output_file: "index.html" + +pipelines: + default: + - classify + - feed + - site diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap deleted file mode 100644 index 3e513413..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_single_with_tag.snap +++ /dev/null @@ -1,36 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - wy - -tags: - education: - description: | - Legislation related to schools, education funding, curriculum standards, and educational policy, including: - - K-12 public school funding, budgets, and resource allocation - - Curriculum standards, content requirements, and academic programs - - Teacher certification, training, professional development, and compensation - - Higher education policy, tuition, financial aid, and student loans - - Charter schools, school choice, vouchers, and alternative education models - - Special education services, accommodations, and individualized education plans - - School safety, security measures, and student discipline policies - - Early childhood education, pre-K programs, and childcare - - Standardized testing, assessments, and accountability measures - - School district governance, administration, and oversight - - Educational technology, digital learning, and online education - - Career and technical education, vocational training, and workforce development - examples: - - "Increases per-pupil funding for public schools and establishes minimum teacher salary requirements" - - "Mandates comprehensive sex education curriculum in all public schools" - - "Expands eligibility for state financial aid programs to include part-time students" - -build: - base_url: "https://sartaj.me/govbot" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap new file mode 100644 index 00000000..1ae3458c --- /dev/null +++ b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific.snap @@ -0,0 +1,39 @@ +--- +source: tests/wizard_tests.rs +assertion_line: 71 +expression: "&yml" +--- +# Govbot Manifest +# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json + +datasets: + - il + - ca + - ny + +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + # Path to your fastclass classifier bundle (containing classifier.yml). + classifier: ./classifier + +publish: + feed: + type: rss + base_url: "https://example.com" + output_dir: "docs" + output_file: "feed.xml" + site: + type: html + base_url: "https://example.com" + output_dir: "docs" + output_file: "index.html" + +pipelines: + default: + - classify + - feed + - site diff --git a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap b/actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap deleted file mode 100644 index f3ab59ea..00000000 --- a/actions/govbot/tests/snapshots/wizard_tests__wizard_specific_no_tag.snap +++ /dev/null @@ -1,26 +0,0 @@ ---- -source: tests/wizard_tests.rs -expression: "&yml" ---- -# Govbot Configuration -# Schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json - -repos: - - il - - ca - - ny - -tags: - # Add your tags here. Example: - # my_topic: - # description: | - # Legislation related to ... - # examples: - # - "Example bill description" - {} - -build: - base_url: "https://example.com" - output_dir: "docs" - output_file: "feed.xml" diff --git a/actions/govbot/tests/wizard_tests.rs b/actions/govbot/tests/wizard_tests.rs index e8f30725..0ac38cdb 100644 --- a/actions/govbot/tests/wizard_tests.rs +++ b/actions/govbot/tests/wizard_tests.rs @@ -1,5 +1,5 @@ +use govbot::config::Manifest; use govbot::wizard::{generate_govbot_yml, WizardChoices, WizardSession}; -use govbot::publish::{load_config, get_repos_from_config}; // ============================================================ // Full wizard session snapshots — shows the entire user experience @@ -7,66 +7,35 @@ use govbot::publish::{load_config, get_repos_from_config}; // ============================================================ #[test] -fn wizard_session_all_repos_with_example_tag() { +fn wizard_session_all_datasets() { let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["all".to_string()], - include_example_tag: true, + datasets: vec!["all".to_string()], base_url: "https://myuser.github.io/my-govbot".to_string(), }); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_session_all_with_tag", &session.to_snapshot()); + insta::assert_snapshot!("wizard_session_all", &session.to_snapshot()); }); } #[test] -fn wizard_session_all_repos_own_tags() { +fn wizard_session_specific_datasets() { let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["all".to_string()], - include_example_tag: false, - base_url: "https://example.com".to_string(), - }); - let mut settings = insta::Settings::clone_current(); - settings.set_snapshot_path("snapshots"); - settings.bind(|| { - insta::assert_snapshot!("wizard_session_all_own_tags", &session.to_snapshot()); - }); -} - -#[test] -fn wizard_session_specific_repos_with_example_tag() { - let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["il".to_string(), "ca".to_string(), "ny".to_string()], - include_example_tag: true, + datasets: vec!["il".to_string(), "ca".to_string(), "ny".to_string()], base_url: "https://activist.github.io/legislation".to_string(), }); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_session_specific_with_tag", &session.to_snapshot()); - }); -} - -#[test] -fn wizard_session_specific_repos_own_tags() { - let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["il".to_string(), "ca".to_string(), "ny".to_string()], - include_example_tag: false, - base_url: "https://example.com".to_string(), - }); - let mut settings = insta::Settings::clone_current(); - settings.set_snapshot_path("snapshots"); - settings.bind(|| { - insta::assert_snapshot!("wizard_session_specific_own_tags", &session.to_snapshot()); + insta::assert_snapshot!("wizard_session_specific", &session.to_snapshot()); }); } #[test] fn wizard_session_single_state() { let session = WizardSession::from_choices(&WizardChoices { - repos: vec!["wy".to_string()], - include_example_tag: true, + datasets: vec!["wy".to_string()], base_url: "https://sartaj.me/govbot".to_string(), }); let mut settings = insta::Settings::clone_current(); @@ -81,137 +50,136 @@ fn wizard_session_single_state() { // ============================================================ #[test] -fn test_generate_govbot_yml_all_repos_with_example_tag() { - let yml = generate_govbot_yml(&["all".to_string()], true, "https://myuser.github.io/my-govbot"); +fn test_generate_govbot_yml_all_datasets() { + let yml = generate_govbot_yml(&["all".to_string()], "https://myuser.github.io/my-govbot"); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_all_with_tag", &yml); + insta::assert_snapshot!("wizard_all", &yml); }); } #[test] -fn test_generate_govbot_yml_specific_repos_no_tag() { +fn test_generate_govbot_yml_specific_datasets() { let yml = generate_govbot_yml( &["il".to_string(), "ca".to_string(), "ny".to_string()], - false, "https://example.com", ); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_specific_no_tag", &yml); + insta::assert_snapshot!("wizard_specific", &yml); }); } #[test] -fn test_generate_govbot_yml_all_repos_no_tag() { - let yml = generate_govbot_yml(&["all".to_string()], false, "https://example.com"); +fn test_generate_govbot_yml_single_dataset() { + let yml = generate_govbot_yml(&["wy".to_string()], "https://sartaj.me/govbot"); let mut settings = insta::Settings::clone_current(); settings.set_snapshot_path("snapshots"); settings.bind(|| { - insta::assert_snapshot!("wizard_all_no_tag", &yml); - }); -} - -#[test] -fn test_generate_govbot_yml_single_repo_with_tag() { - let yml = generate_govbot_yml(&["wy".to_string()], true, "https://sartaj.me/govbot"); - let mut settings = insta::Settings::clone_current(); - settings.set_snapshot_path("snapshots"); - settings.bind(|| { - insta::assert_snapshot!("wizard_single_with_tag", &yml); + insta::assert_snapshot!("wizard_single", &yml); }); } // ============================================================ -// Round-trip tests — generate YAML, write to disk, parse back, -// and verify the parsed config has the expected structure +// Round-trip tests — generate YAML, write to disk, parse back +// as a typed Manifest, and verify the parsed manifest structure // ============================================================ #[test] -fn test_generated_yml_is_valid_yaml_with_tag() { - let yml = generate_govbot_yml(&["all".to_string()], true, "https://myuser.github.io/my-govbot"); +fn test_generated_yml_is_valid_manifest() { + let yml = generate_govbot_yml(&["all".to_string()], "https://myuser.github.io/my-govbot"); let dir = tempfile::tempdir().unwrap(); let config_path = dir.path().join("govbot.yml"); std::fs::write(&config_path, &yml).unwrap(); - let config = load_config(&config_path).expect("generated govbot.yml should be valid YAML"); - - // Verify repos - let repos = get_repos_from_config(&config); - assert_eq!(repos, vec!["all"]); - - // Verify tags exist and have expected structure - let tags = config.get("tags").expect("should have tags key"); - let tags_obj = tags.as_object().expect("tags should be an object"); - assert!(tags_obj.contains_key("education"), "should contain education tag"); - let education = tags_obj.get("education").unwrap().as_object().unwrap(); - assert!(education.contains_key("description"), "education tag should have description"); - assert!(education.contains_key("examples"), "education tag should have examples"); - - // Verify build config - let build = config.get("build").expect("should have build key"); - let build_obj = build.as_object().expect("build should be an object"); - assert_eq!(build_obj.get("base_url").unwrap().as_str().unwrap(), "https://myuser.github.io/my-govbot"); - assert_eq!(build_obj.get("output_dir").unwrap().as_str().unwrap(), "docs"); - assert_eq!(build_obj.get("output_file").unwrap().as_str().unwrap(), "feed.xml"); + let manifest = Manifest::load(&config_path).expect("generated govbot.yml should parse"); + + // datasets + assert_eq!(manifest.datasets, vec!["all"]); + + // transforms — the classify transform shells out to fastclass. + let classify = manifest + .transforms + .get("classify") + .expect("should have a classify transform"); + assert_eq!(classify.reads, "docs"); + assert_eq!(classify.writes, "classification"); + assert!( + classify.classifier.is_some(), + "classify should reference a bundle" + ); + + // publish — the RSS feed publisher. + let feed = manifest + .publish + .get("feed") + .expect("should have a feed publisher"); + assert_eq!( + feed.base_url.as_deref(), + Some("https://myuser.github.io/my-govbot") + ); + + // pipelines + assert!(manifest.pipelines.contains_key("default")); } #[test] -fn test_generated_yml_is_valid_yaml_without_tag() { - let yml = generate_govbot_yml( - &["il".to_string(), "ca".to_string()], - false, - "https://example.com", - ); +fn test_generated_yml_specific_datasets_round_trip() { + let yml = generate_govbot_yml(&["il".to_string(), "ca".to_string()], "https://example.com"); let dir = tempfile::tempdir().unwrap(); let config_path = dir.path().join("govbot.yml"); std::fs::write(&config_path, &yml).unwrap(); - let config = load_config(&config_path).expect("generated govbot.yml should be valid YAML"); - - // Verify repos - let repos = get_repos_from_config(&config); - assert_eq!(repos, vec!["il", "ca"]); + let manifest = Manifest::load(&config_path).expect("generated govbot.yml should parse"); + assert_eq!(manifest.datasets, vec!["il", "ca"]); +} - // Verify tags is empty object - let tags = config.get("tags").expect("should have tags key"); - let tags_obj = tags.as_object().expect("tags should be an object"); - assert!(tags_obj.is_empty(), "tags should be empty when no example tag"); +/// A manifest carrying the retired `tags:` block must fail to parse. +#[test] +fn test_manifest_with_tags_block_fails() { + let yml = "datasets:\n - all\ntags:\n education:\n description: x\n"; + let dir = tempfile::tempdir().unwrap(); + let config_path = dir.path().join("govbot.yml"); + std::fs::write(&config_path, yml).unwrap(); - // Verify build config - let build = config.get("build").expect("should have build key"); - let build_obj = build.as_object().expect("build should be an object"); - assert_eq!(build_obj.get("base_url").unwrap().as_str().unwrap(), "https://example.com"); + let result = Manifest::load(&config_path); + assert!( + result.is_err(), + "a govbot.yml containing `tags:` must fail to parse" + ); } #[test] fn test_write_files_creates_govbot_yml() { let choices = WizardChoices { - repos: vec!["wy".to_string()], - include_example_tag: true, + datasets: vec!["wy".to_string()], base_url: "https://sartaj.me/govbot".to_string(), }; let session = WizardSession::from_choices(&choices); let dir = tempfile::tempdir().unwrap(); - session.write_files(dir.path()).expect("write_files should succeed"); + session + .write_files(dir.path()) + .expect("write_files should succeed"); - // Verify govbot.yml was created and is parseable + // Verify govbot.yml was created and parses as a Manifest. let config_path = dir.path().join("govbot.yml"); assert!(config_path.exists(), "govbot.yml should exist"); - let config = load_config(&config_path).expect("written govbot.yml should be valid YAML"); - let repos = get_repos_from_config(&config); - assert_eq!(repos, vec!["wy"]); + let manifest = Manifest::load(&config_path).expect("written govbot.yml should parse"); + assert_eq!(manifest.datasets, vec!["wy"]); - // Verify .gitignore was created + // Verify .gitignore was created. let gitignore_path = dir.path().join(".gitignore"); assert!(gitignore_path.exists(), ".gitignore should exist"); let gitignore = std::fs::read_to_string(&gitignore_path).unwrap(); - assert!(gitignore.contains(".govbot"), ".gitignore should contain .govbot"); + assert!( + gitignore.contains(".govbot"), + ".gitignore should contain .govbot" + ); - // Verify workflow was created + // Verify workflow was created. let workflow_path = dir.path().join(".github/workflows/build.yml"); assert!(workflow_path.exists(), "build.yml workflow should exist"); } diff --git a/actions/pipeline-manager/AGENT.md b/actions/pipeline-manager/AGENT.md new file mode 100644 index 00000000..8f2fc2d2 --- /dev/null +++ b/actions/pipeline-manager/AGENT.md @@ -0,0 +1,131 @@ +# pipeline-manager — agent playbook + +Read this before editing anything in `actions/pipeline-manager/`. It is the playbook for the **data catalog layer**: the declarative YAMLs + Python orchestration that ship workflow code to the per-jurisdiction repos which actually produce govbot's legislative data. + +If you're a human, the root `AGENT.md` is the right entrypoint — this file assumes you're already oriented to the four-tool govbot stack. + +## 1. Mental model + +This directory is **a declarative repo factory**, not a runtime scraper. + +It reads two YAML catalogs (`chn-openstates-scrape.yml`, `chn-openstates-files.yml`), renders workflow templates per locale into `generated/`, and reconciles the resulting set of per-state repos against a GitHub org — create missing, update drifted, delete orphans. + +The actual scraping/formatting happens **inside the generated GitHub Actions workflows in those per-state repos**. This directory doesn't run scrapers; it ships workflow YAML to repos that do. + +Two orgs are in play: + +- **`chn-openstates-scrapers`** — raw OpenStates output, one repo per jurisdiction. Driven by `chn-openstates-scrape.yml`. +- **`chn-openstates-files`** (a.k.a. `govbot-data` post-rename) — OCD-formatted data, one repo per jurisdiction. Driven by `chn-openstates-files.yml`. Triggered from the scraper repo via `repository_dispatch: scrape-and-format-complete`. + +The `chn-openstates-files` org is what `govbot pull` actually reads — it's the user-visible side. + +## 2. The two configs, side-by-side + +| File | What it manages | Per-locale knobs | +|---|---|---| +| `chn-openstates-scrape.yml` | Scraper repos (OpenStates → raw output) | `template`, `toolkit_branch`, `name`, `disabled_jobs`, `labels` | +| `chn-openstates-files.yml` | Formatter repos (raw → OCD `metadata.json` + logs) | same | + +The `labels: [working]` flag on `chn-openstates-files.yml` is what **gates user-visible publication**. A jurisdiction missing that label ships empty/broken data even if the entry exists. As of 2026-05 the gaps are AZ, CT, TX, VA on the files side (chihacknight/govbot#33) and ~19 jurisdictions on the scrape side. + +`disabled_jobs:` is a list of workflow filenames (without extension) to skip rendering — most locales disable `extract-text` because text extraction isn't wired up yet. + +## 3. Python orchestration — read these in this order + +For any change beyond editing a single YAML line: + +1. **`render.py`** — parses YAML, walks locales, does sed-style `✏️{ var }✏️` substitution into `generated///...`. Filter flags: `--all-states`, `--test-states ak,wy`. Defaults to a 5-state sample (`al,ak,de,wy,sd`) when neither is set. + +2. **`apply.py`** — orchestrator. Key sections: + - `get_expected_repos` (lines 59–161): shells out to `render.py`, walks `generated/`, builds the set of repos that *should* exist. + - `get_actual_repos` (lines 164–187): `gh repo list `. + - `create_repo` / `update_repo` / `delete_repo` (lines 190–484): reconcile. + - `fully_override_dirs` (default `[".github"]`): which dirs in the target repo get authoritative overwrite — files there that aren't in the template get deleted. Other dirs are additive-merge, preserving user/data files. + +3. **`config.schema.json`** — JSON Schema validating both YAMLs. Read this before adding a new locale knob. + +## 4. How to make the common changes + +### A. Add a new state/territory + +1. Add a `locales.:` entry to **both** `chn-openstates-scrape.yml` and `chn-openstates-files.yml`. Crib from a neighbor. +2. Add the matching entry to `/Users/sartaj/Git/govbot/actions/govbot/data/registry.json` under `us-legislation/`. **This step is the easy one to forget — see §6.** +3. Run `./render-snapshots.sh` only if the new code is in the snapshot sample (`ak,id,mt,pr,wy`). Otherwise no snapshot churn. +4. Verify: `python3 apply.py -c chn-openstates-files.yml --test-states --dry-run`. + +### B. Mark a stuck jurisdiction as working / not-working (issue-#33 shape) + +1. Add or remove `labels: [working]` on the locale entry in `chn-openstates-files.yml`. +2. The fix doesn't live here — diagnose the underlying scraper/formatter failure by inspecting the per-state repo's **Actions tab on GitHub** (e.g. `https://github.com/chn-openstates-files/az-legislation/actions`). This directory has zero runtime logs. +3. No snapshot change needed — `labels` is metadata, doesn't flow into rendered workflows. + +### C. Change the workflow template (affects every jurisdiction) + +1. Edit `templates/openstates-to-ocd-files/.github/workflows/format.yml` (files side) or `templates/openstates-scrape/...` (scrape side). +2. Run `./render-snapshots.sh` and commit the snapshot diff. **The diff in `__snapshots__/` is the review surface** — without it, reviewers can't see what 55 repos are about to receive. +3. Then `python3 apply.py -c .yml --all-states --dry-run` to see how many repos would receive the update. + +### D. Add a new dataset family that isn't OpenStates (Councilmatic — #30, Executive Actions — #28) + +1. New template dir: `templates//` containing the workflow YAML the per-locale repos should carry. +2. New top-level config YAML next to the existing two, registering `template_markers`, `org`, `templates`, `locales`. +3. Wire the `templates:` block + `folder-name:` pattern. `apply.py` is family-agnostic — no Python changes needed. +4. Add resulting dataset IDs to `actions/govbot/data/registry.json`. If the namespace isn't `us-legislation`, set the right one (e.g. `us-executive`, `chicago-council`). + +## 5. Verification loop + +Always before any `gh repo create / update / delete` run: + +```bash +cd actions/pipeline-manager + +# 1. Render only — never hits GitHub: +python3 render.py -c chn-openstates-files.yml --test-states + +# 2. Reconcile preview — calls `gh repo list` but no mutations: +python3 apply.py -c chn-openstates-files.yml --test-states --dry-run + +# 3. Snapshot regen (only if a sample-set state changed OR a template changed): +./render-snapshots.sh +``` + +**Footgun:** the `to delete: N` line in the dry-run summary. If N is unexpectedly large, do NOT run without `--no-delete`. Some repos in the org may exist intentionally outside the catalog (e.g. issue-#32's proposed per-session repos would land that way). + +## 6. The cross-tool sync gotcha — call this out loudly + +The **Python catalog and the Rust registry are independent sources of truth** and they drift. + +- **Python side** (`chn-openstates-{scrape,files}.yml`) sets `org.username: chn-openstates-files`. This is the org repos get created in. +- **Rust side** (`/Users/sartaj/Git/govbot/actions/govbot/data/registry.json`) is hand-maintained, baked into the binary via `include_str!`, and as of 2026-05 still points every `git_url` at `chn-openstates-files/` — the **predecessor** of the `govbot-data` org. Issue chihacknight/govbot#32 flags this. + +If/when the `chn-openstates-files` → `govbot-data` org rename completes, **both** must move together. Touching only one creates a "user follows AGENT.md, lands on stale org" failure. + +Same drift risk on every add/remove: Python adds the workflow repo, Rust needs the registry entry pointing at where data actually lands. + +**Rule:** never change one without checking the other. One grep is enough: + +```bash +rg "chn-openstates-files|govbot-data" actions/ +``` + +## 7. Where this layer stops + +What this directory does **not** own (don't drift into these): + +- The actual scraper code — lives in the generated per-state repos + the upstream `openstates/openstates-scrapers` project. +- The OCD format conversion — lives in `actions/format/` and is invoked from the generated `format.yml` workflow. +- Text extraction (issue #31) — would be a new workflow step calling something under `actions/extract/`; this directory would only add the workflow-template wiring. +- The `govbot pull` cache, stream protocol, and `--select docs` projection — owned by `actions/govbot/` (Rust). + +## Critical files + +Read these first when in doubt: + +- `chn-openstates-files.yml` — the catalog +- `chn-openstates-scrape.yml` — the scrape-side catalog +- `apply.py` (lines 59–161, 307–484) — orchestration + update logic +- `render.py` — template rendering +- `config.schema.json` — locale schema +- `render-snapshots.sh` — the 5-state sample, deterministic across platforms +- `templates/openstates-to-ocd-files/.github/workflows/format.yml` — the workflow template every files-side jurisdiction receives +- `/Users/sartaj/Git/govbot/actions/govbot/data/registry.json` — the Rust-side sync target (§6) diff --git a/schemas/README.md b/schemas/README.md index 21841a76..fb2520fa 100644 --- a/schemas/README.md +++ b/schemas/README.md @@ -8,7 +8,7 @@ The `*.schema.json` files are JSON Schema definitions that validate the structur ### Configuration Schemas -- **`govbot.schema.json`** - Schema for `govbot.yml` configuration files used by the govbot CLI tool. Defines the structure for repositories, tags, and RSS publishing configuration. +- **`govbot.schema.json`** - Schema for `govbot.yml` manifest files used by the govbot CLI tool. A `govbot.yml` is a project manifest: it declares the `datasets` a project consumes, the `transforms` it runs over them, the `publish` publishers that emit artifacts, and named `pipelines` that wire those stages together. It is **not** a classifier — the tag taxonomy lives in a separate fastclass classifier bundle (`classifier.yml`) that govbot only references by path. ### Data Schemas @@ -102,12 +102,24 @@ Schemas can be referenced in YAML files using the `$schema` key: ```yaml # govbot.yml -$schema: https://raw.githubusercontent.com/windy-civi/toolkit/main/schemas/govbot.schema.json +$schema: https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json -repos: +datasets: - all -tags: - # ... tag definitions +transforms: + classify: + command: [fastclass, classify, "-"] + reads: docs + writes: classification + classifier: ./classifier # path to the fastclass classifier bundle +publish: + feed: + type: rss + base_url: "https://example.github.io/my-govbot" +pipelines: + default: + - classify + - feed ``` This enables: diff --git a/schemas/STREAM_PROTOCOL.md b/schemas/STREAM_PROTOCOL.md new file mode 100644 index 00000000..5f4290b8 --- /dev/null +++ b/schemas/STREAM_PROTOCOL.md @@ -0,0 +1,153 @@ +# govbot stack — frozen cross-domain contract + +**Status:** FROZEN for the layering refactor (build-sequence steps 1–6). +**Owner:** head architect. Subagents treat this as read-only input. A subagent that +finds the contract unworkable must escalate to the architect for a re-freeze — it does +not change the contract unilaterally. + +This is the load-bearing interface between the three layers — `fastclass` (classifier), +`govbot` (gov-data tool), and userland apps. The layers compose over **process +boundaries** (newline-delimited JSON on stdio), never as linked libraries. + +--- + +## 1. The input stream protocol (govbot → transform) + +`govbot` streams documents to a transform (e.g. `fastclass classify -`) as +**newline-delimited JSON** — one object per line, UTF-8, `\n`-terminated: + +```json +{"id": "", "text": "", "kind": "docs", "subjects": ["ENERGY", "ENVIRONMENT"]} +``` + +- **`id`** — an opaque routing key. The transform treats it as opaque and **echoes it + back unchanged** in the result. For govbot's `docs` projection it is the bill's + dataset path; no consumer parses its structure. +- **`text`** — the document body. For govbot's `docs` projection this is the **full + bill text** assembled from `metadata.json` (not just titles). +- **`kind`** — **required**. Tags the stream record type (`docs` today; future + `summary`, etc.). A transform that does not recognize a `kind` **passes the record + through untouched** rather than erroring. +- **`subjects`** — **optional**. When the source is an OCD-files bill whose + `metadata.json` carries a non-empty `subject:` array (e.g. `["ENERGY", + "ENVIRONMENT", "TAXATION"]`), govbot's `docs` projection surfaces those tags + here verbatim. These are gold-standard structured classifications assigned + by human OCD scrapers and are the canonical input a `concept_match`-style + matcher should consume rather than re-deriving topic signals from `text`. + The field is **omitted entirely** when the bill has no `subject:` key, when + the array is empty (`[]`), or when every element is blank — "no signal" + is unambiguous, so consumers never have to distinguish "absent" from + "explicitly empty". Bare log records (no bill metadata joined) also omit + it. Transforms that don't know about `subjects` ignore it; the stream + contract is additive. + +A transform reads this stream line-by-line and emits one result line per input line. + +## 2. The classify result (`ClassifyResult`) + +`fastclass classify` emits one `ClassifyResult` JSON object per input document. The +echoed identifier field is named **`doc`** (NOT `id`) — this is frozen; downstream +sinks (`govbot apply`) route on `doc`. Full shape — see +`fastclass/schemas/result.schema.json` for the machine-readable schema: + +```json +{ + "doc": "", + "text_hash": "sha256:", + "classifier_version": "sha256:<12-hex>", + "fusion_version": "fusion-v1", + "tags": { + "": { + "matched": true, + "threshold": 0.3, + "matcher_outputs": [ + {"kind": "keyword", "version": "...", "role": "scorer", + "raw_score": 1.0, "evidence": [{"kind": "keyword_hit", "detail": "solar"}]} + ], + "fusion": {"version": "fusion-v1", "final_score": 0.92, "gated": false} + } + } +} +``` + +- `matcher_outputs[].role` is one of `scorer` | `gate` | `penalty`. +- `tags` is ordered by tag name (byte-stable, snapshot-testable). + +## 3. `fastclass describe` + +`fastclass describe classifier=` emits a single JSON object so govbot can +type-check a transform DAG and validate that `publish.*.select:` tag names exist: + +```json +{ + "reads": ["docs"], + "writes": ["classification"], + "tags": ["clean_energy", "conservation", "emissions_and_climate", "fossil_fuels"], + "classifier_version": "sha256:<12-hex>", + "fusion_version": "fusion-v1", + "model": {"name": "sentence-transformers/all-MiniLM-L6-v2", "sha256_prefix": "<12-hex>"}, + "model_rerank": {"name": "cross-encoder/ms-marco-MiniLM-L-6-v2", "sha256_prefix": "<12-hex>"} +} +``` + +- `tags` is the sorted list of active tag names from the bundle. +- `describe` is a **subcommand** (not a `classify` flag). +- **`model`** — **optional**. Present iff an embedding model is installed at + `/model/` (the Tier-2 semantic matcher; installed via + `fastclass model fetch` or the `/fastclass:install-model` plugin command). + Shape is `{name?: string, sha256_prefix: string}`: `sha256_prefix` is the + first 12 hex chars of the model file's SHA-256; `name` is the + `KNOWN_MODELS` identifier (e.g. `sentence-transformers/all-MiniLM-L6-v2`) + when the prefix matches a vetted entry, and is **omitted** for a + user-staged custom model whose SHA isn't on the vetted list. The block is + **omitted entirely** for a lexical-only bundle — like `subjects` in §1, + this is additive: consumers that don't know about `model` ignore it, and + the lexical-only describe output is byte-identical to the pre-Tier-2 + contract. +- **`model_rerank`** — **optional**. Present iff a reranker model is + installed at `/model-rerank/` (sibling of `/model/`). + Same shape as `model` (`{name?: string, sha256_prefix: string}`) and + same `name` rule: set to the `KNOWN_MODELS` row when the SHA matches a + vetted entry, **omitted** for a user-staged reranker whose SHA isn't on + the vetted list. The block is **omitted entirely** for a bundle without + a reranker installed. Additive in the same way as `model`: consumers + that don't know about `model_rerank` ignore it, and a bundle with no + reranker produces describe output byte-identical to the pre-rerank + contract. + +## 4. The classifier-bundle layout + +A classifier bundle is a **directory**. `fastclass` owns its contents; `govbot` only +passes the path (`classifier=`). `fastclass` must NOT know the word "govbot" — +`govbot.yml` is **not** a recognized bundle file. + +``` +/ + classifier.yml taxonomy (REQUIRED; `fastclass.yml` is an accepted alias) + fusion.yml fusion weights + the cascade "uncertainty band" (optional) + eval/ + constitution.yml frozen gold set — never enters an LLM context + rolling.yml refreshable working eval set (optional) + proposals/ improvement-proposal history + model/ optional embedding model + model-rerank/ optional reranker model (sibling of model/) + fastclass.lock pins bundle + binary versions for lineage +``` + +## 5. Calibrated scores + +`fusion.final_score` is contractually a **calibrated** probability in `[0, 1]` — +downstream consumers (publisher thresholds, summarizer gating) may threshold it +directly. Calibration **regression** is *flagged* in the backtest verdict, **not +blocked** (soft gate; hardening deferred). + +--- + +## Layer ownership rule + +Each layer owns its own config; **no file is shared across a layer boundary.** + +- `classifier.yml` / `fusion.yml` / the bundle — fastclass's. +- `govbot.yml` (the manifest: `datasets` / `transforms` / `publish` / `pipelines`) — + govbot's. It has **no `tags:`**. +- A userland repo merely *contains* both — it owns neither tool's internals. diff --git a/schemas/govbot.schema.json b/schemas/govbot.schema.json index 555a2318..6b75b45b 100644 --- a/schemas/govbot.schema.json +++ b/schemas/govbot.schema.json @@ -1,61 +1,126 @@ { "$schema": "http://json-schema.org/draft-07/schema#", - "title": "Govbot Configuration Schema", - "description": "Schema for validating govbot.yml configuration files", + "$id": "https://raw.githubusercontent.com/chihacknight/govbot/main/schemas/govbot.schema.json", + "title": "Govbot Manifest Schema", + "description": "Schema for validating govbot.yml manifest files. govbot.yml is a project manifest -- it declares the datasets a project consumes, the transforms it runs over them, the publishers that emit artifacts, and named pipelines that wire those stages together. It is NOT a classifier: the tag taxonomy lives in a separate fastclass classifier bundle (classifier.yml) which govbot only references by path.", "type": "object", "properties": { - "repos": { - "description": "List of repositories to clone and process. Use 'all' to include all available repositories.", + "$schema": { + "description": "Optional reference to this schema for editor autocomplete and validation.", + "type": "string" + }, + "datasets": { + "description": "Government-data sources the project pulls and processes. Each entry is a dataset identifier -- a short code such as 'wy' today, or a 'namespace/name[@channel]' registry reference in a later wave. 'all' selects every dataset known to govbot. The schema is intentionally permissive (array of strings); structured registry references are validated by govbot at resolve time.", "type": "array", "items": { - "type": "string" + "type": "string", + "description": "A dataset identifier (e.g. 'wy', 'all', or a future 'namespace/name@channel' registry reference)." }, + "minItems": 1, "default": ["all"] }, - "tags": { - "description": "Tag definitions for categorizing legislation. Each tag should have a description and optional examples.", + "transforms": { + "description": "Named external-process transforms. A transform is a separate program that speaks the govbot stream protocol (newline-delimited JSON on stdio, stable 'id', typed 'kind'). govbot streams records of the transform's 'reads' kind into it and routes the records of its 'writes' kind back by 'id'. fastclass classification is one such transform; a future local-LLM summarizer is another.", "type": "object", "additionalProperties": { - "$ref": "#/definitions/tag" + "$ref": "#/definitions/transform" } }, "publish": { - "description": "RSS feed publishing configuration", + "description": "Named publishers. A publisher consumes a result stream and emits one artifact: an RSS feed, an HTML index, a JSON dump, a DuckDB database, or Bluesky posts. Each publisher declares a 'type' plus type-specific configuration. To emit multiple artifact kinds (e.g. both an RSS feed and an HTML index), declare one publisher per kind.", + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/publisher" + } + }, + "pipelines": { + "description": "Named 'govbot run' targets, npm-script style. Each pipeline is an ordered list of stage references -- names of entries in 'transforms' and 'publish' -- executed in sequence. 'govbot run ' runs the named pipeline.", + "type": "object", + "additionalProperties": { + "$ref": "#/definitions/pipeline" + } + } + }, + "required": ["datasets"], + "additionalProperties": false, + "definitions": { + "transform": { "type": "object", + "description": "A single external-process transform stage.", "properties": { - "base_url": { - "description": "Base URL for RSS feed links (required for GitHub Pages). Should match your GitHub Pages URL.", + "command": { + "description": "The external process to run, given the govbot stream on stdin and emitting results on stdout. Either a single string (parsed as a shell-style command) or an argv array (first element is the executable, the rest are arguments).", + "oneOf": [ + { + "type": "string" + }, + { + "type": "array", + "items": { + "type": "string" + }, + "minItems": 1 + } + ] + }, + "reads": { + "description": "The stream record kind this transform consumes. govbot only feeds records of this kind into the transform; records of other kinds pass through untouched. 'docs' is the document-projection kind defined by the stream protocol.", "type": "string", - "format": "uri", - "default": "https://example.com" + "examples": ["docs", "classification", "summary"] }, - "output_dir": { - "description": "Directory where RSS feeds are generated", + "writes": { + "description": "The stream record kind this transform produces. govbot routes records of this kind back into the dataset (or onward to the next stage) by their 'id'. The classify transform writes 'classification'.", "type": "string", - "default": "feeds" + "examples": ["classification", "summary"] }, - "output_file": { - "description": "Output filename for the RSS feed", + "classifier": { + "description": "For a classify-style transform: the path to the fastclass classifier bundle directory (containing classifier.yml). govbot passes this path to the transform unchanged and never reads the bundle's contents itself.", + "type": "string" + } + }, + "required": ["command", "reads", "writes"], + "additionalProperties": true + }, + "publisher": { + "type": "object", + "description": "A single publisher stage. The required fields depend on 'type'. Each publisher kind emits exactly ONE artifact: 'rss' writes the RSS feed (default feed.xml), 'html' writes the HTML index (default index.html), 'json' writes a JSON dump, 'duckdb' loads records into a DuckDB database, 'bluesky' posts to a Bluesky account. To get both a feed and an HTML index, declare both an 'rss' and an 'html' publisher.", + "properties": { + "type": { + "description": "The publisher kind. 'rss' writes the RSS feed only; 'html' writes the HTML index only; 'json' emits a JSON dump; 'duckdb' loads results into a DuckDB database; 'bluesky' posts to a Bluesky account.", "type": "string", - "default": "feed.xml" + "enum": ["rss", "html", "json", "duckdb", "bluesky"] }, - "tags": { - "description": "Specific tags to include in the combined RSS feed. If not specified, all tags are included.", + "select": { + "description": "Tag names to include. Only records carrying at least one of these tags are published; if omitted, all tagged records are published. Tag names must exist in the classifier bundle (govbot validates them against 'fastclass describe').", "type": "array", "items": { "type": "string" } }, + "base_url": { + "description": "Base URL for generated links. Required for rss/html (e.g. the GitHub Pages URL). For bluesky, '{link}' defaults to the manifest's companion 'html' publisher's base_url (the human-readable landing page activists click through to); when no 'html' publisher is configured, '{link}' falls back to this 'base_url' joined to the bill's dataset path, and then to 'bill.sources[0].url'.", + "type": "string", + "format": "uri" + }, + "output_dir": { + "description": "Directory where the publisher writes its artifacts (used by rss/html/json).", + "type": "string", + "default": "docs" + }, + "output_file": { + "description": "Output filename for the publisher's single artifact. Defaults by 'type': 'rss' -> 'feed.xml', 'html' -> 'index.html', 'json' -> 'feed.json', 'duckdb' -> 'feed.duckdb'.", + "type": "string" + }, "title": { - "description": "Custom feed title. If not specified, defaults to combined tag names.", + "description": "Custom feed/index title. If omitted, defaults to a title derived from the selected tag names.", "type": "string" }, "description": { - "description": "Custom feed description. If not specified, defaults to combined tag descriptions.", + "description": "Custom feed/index description. If omitted, defaults to a description derived from the selected tags.", "type": "string" }, "limit": { - "description": "Limit number of entries per RSS feed. Use 'none' for no limit, or a number. Default is 15 (RSS standard).", + "description": "Maximum number of entries to include. Use the string 'none' for no limit, or a positive integer.", "oneOf": [ { "type": "string", @@ -66,30 +131,48 @@ "minimum": 1 } ] - } - }, - "required": ["base_url", "output_dir", "output_file"] - } - }, - "required": ["repos", "tags"], - "definitions": { - "tag": { - "type": "object", - "properties": { - "description": { - "description": "Detailed description of what legislation this tag covers. Use YAML multiline strings (|) for formatting.", + }, + "min_score": { + "description": "Bluesky publisher only. The minimum calibrated 'final_score' a matched tag must reach for a record to be posted. 'final_score' is the calibrated probability in [0,1] guaranteed by the stream protocol, so it can be thresholded directly. Defaults to 0.6 -- a conservative value so a misconfigured manifest does not flood a feed with low-confidence matches.", + "type": "number", + "minimum": 0, + "maximum": 1, + "default": 0.6 + }, + "ledger": { + "description": "Bluesky publisher only. Path to the append-only posted-state ledger that makes the publisher idempotent -- it records the id of every record posted so re-runs never double-post. Relative paths resolve against the project directory. Defaults to 'state/bluesky-.ledger' (peer to 'tags/' and 'dist/'); NOT under '.govbot/', which is the tool's regenerable cache. If a ledger file exists at the legacy '.govbot/bluesky-.ledger' location from a pre-fix install, it is read as a fallback so post history survives the upgrade; writes always land at the new path.", "type": "string" }, - "examples": { - "description": "Example bill descriptions that would match this tag", - "type": "array", - "items": { - "type": "string" + "post_template": { + "description": "Bluesky publisher only. The post-text template. Supported placeholders, substituted per record: '{title}', '{tags}', '{link}', '{identifier}', '{session}', '{score}'. Rendered text is truncated to Bluesky's 300-character limit. If omitted, a sensible default template is used. Bluesky credentials are NEVER schema fields -- they are read from the environment (BLUESKY_HANDLE, BLUESKY_APP_PASSWORD, optional BLUESKY_SERVICE).", + "type": "string" + } + }, + "required": ["type"], + "additionalProperties": true, + "allOf": [ + { + "if": { + "properties": { + "type": { + "enum": ["rss", "html"] + } + } + }, + "then": { + "required": ["base_url"] } } + ] + }, + "pipeline": { + "type": "array", + "description": "An ordered list of stage references. Each item names an entry in the manifest's 'transforms' or 'publish' map; govbot runs them in order.", + "items": { + "type": "string", + "description": "The name of a transform or publisher defined elsewhere in this manifest." }, - "required": ["description"] + "minItems": 1 } } } -