docs: document the sync orchestrator as single content entry point

JuanLara18 · JuanLara18 · commit dfaffaebfe5d · 2026-04-11T10:39:53.000-05:00
diff --git a/README.md b/README.md
@@ -52,24 +52,25 @@ npm start
 
 The blog manifest (`blogData.json`) is generated automatically before each build via `prebuild`.
 
-## Adding a post
+## Adding or updating a post
 
-Create a `.md` file in `front/public/blog/posts/<category>/` with YAML frontmatter, then commit and push — GitHub Actions handles the rest.
+1. Create or edit `front/public/blog/posts/<category>/<slug>.md` with YAML frontmatter.
+2. Run the content pipeline — one command handles orphan cleanup, Mermaid validation, image optimization, EN+ES audio generation, and blog data rebuild:
+   ```bash
+   cd front
+   npm run sync              # full pipeline
+   npm run sync:fast         # skip Spanish audio (quick text iteration)
+   ```
+3. Commit everything and push. GitHub Actions runs `sync:check` before each deploy to catch inconsistencies early.
 
-To also generate narrated audio (EN + ES) for the new post, run the one-shot
-wrapper (requires Python + Ollama for Spanish — see scripts README):
-
-```bash
-./front/scripts/generate_audio.sh          # bash / WSL / git-bash
-.\front\scripts\generate_audio.ps1         # Windows PowerShell
-```
+Requires Python 3.10+ for the audio pipeline. Ollama is optional — if not installed, Spanish audio is skipped with a warning.
 
 ## Tooling
 
-All build-time scripts (blog data generation, image optimization, Mermaid
-validation, PDF export, audio narration pipeline) are documented in
-[`front/scripts/README.md`](front/scripts/README.md). For a dev-oriented
-walkthrough of the React app, see [`front/README.md`](front/README.md).
+All build-time scripts — content orchestrator (`sync.py`), blog data generation,
+image optimization, Mermaid validation, PDF export, audio narration pipeline —
+are documented in [`front/scripts/README.md`](front/scripts/README.md). For a
+dev-oriented walkthrough of the React app, see [`front/README.md`](front/README.md).
 
 ## License
 
diff --git a/front/README.md b/front/README.md
@@ -34,31 +34,55 @@ front/
 |---|---|
 | `npm start` | CRA dev server |
 | `npm run build` | Production build. `prebuild` regenerates `src/data/blogData.json` |
+| `npm run sync` | **Main content pipeline.** Clean orphan audio, validate Mermaid, optimize images, generate EN + ES audio, rebuild blog data |
+| `npm run sync:fast` | Same as `sync` but skips Spanish audio (fast text iteration) |
+| `npm run sync:check` | Validate only; no side effects. Runs in CI before every deploy |
 | `npm run build-blog-data` | Rebuild the blog manifest only |
-| `npm run optimize-images` | Run image optimization (WebP + resized variants) |
+| `npm run optimize-images` | Image optimization (WebP + resized variants) |
 | `npm run validate-mermaid` | Lint Mermaid fences across all posts |
 | `npm run generate-pdf` | Render all posts into `output/blog-compilation.pdf` |
-| `npm run generate-audio` | Regenerate English audio narration |
 
-For the full tooling reference — including the Python audio pipeline, Ollama
-setup for Spanish narration, and the one-shot `generate_audio.sh` /
-`generate_audio.ps1` wrappers — see [`scripts/README.md`](scripts/README.md).
+For the full tooling reference — the Python audio pipeline, Ollama setup for
+Spanish narration, edge cases, and troubleshooting — see
+[`scripts/README.md`](scripts/README.md).
 
-## Adding a new blog post
+## Adding or updating a blog post
 
-1. Create `public/blog/posts/<category>/<slug>.md` with YAML frontmatter
-   (title, date, category, tags, excerpt, readingTime, …).
-2. (Optional) Add audio narration:
-   ```bash
-   ./scripts/generate_audio.sh          # bash / WSL / git-bash
-   .\scripts\generate_audio.ps1         # Windows PowerShell
-   ```
-3. Commit the markdown, any images, and the generated MP3 + sidecar JSON
-   files. GitHub Actions deploys on push to `main`.
+```bash
+# 1. Edit the markdown
+vim public/blog/posts/<category>/<slug>.md
+
+# 2. Sync (incremental; auto-detects what needs regenerating)
+npm run sync            # full — EN + ES audio, ~minutes per new post in Spanish
+# or:
+npm run sync:fast       # skip ES if you're just iterating on text
+
+# 3. Commit everything that changed
+git add -A && git commit -m "post: <title>"
+```
+
+`sync` handles the edge cases so you don't have to:
+
+- **Rename a post** (change the `.md` filename) — old audio is detected as
+  orphan and deleted automatically.
+- **Move between categories** — same: orphan cleanup catches it.
+- **Edit only code blocks, diagrams, math, or frontmatter** — audio cache
+  stays valid; no regeneration, no waiting.
+- **Edit prose** — only the affected post's audio regenerates.
+- **Ollama not installed** — Spanish audio is skipped with a warning; English
+  still works.
 
 ## Deployment
 
-`.github/workflows/deploy.yml` builds on every push touching `front/**`,
-runs `build-blog-data.js` via `prebuild`, conditionally optimizes new images,
-and deploys the build to the `gh-pages` branch. Audio is **not** regenerated
-in CI — MP3s are committed to the repo.
+`.github/workflows/deploy.yml` runs on every push touching `front/**`:
+
+1. Set up Node 18 and Python 3.11
+2. `npm ci`
+3. **`npm run sync:check`** — fail-fast on orphans or Mermaid errors before
+   spending minutes on a doomed build
+4. Optimize new images (only if unoptimized files are detected)
+5. `npm run build` (which runs `prebuild → build-blog-data.js`)
+6. Deploy `build/` to the `gh-pages` branch
+
+Audio is **not** regenerated in CI — MP3s are committed to the repo. The local
+`npm run sync` is what produces them.
diff --git a/front/scripts/README.md b/front/scripts/README.md
@@ -1,40 +1,96 @@
 # Build & Tooling Scripts
 
-Build-time utilities for the portfolio: blog data generation, image optimization,
-diagram validation, PDF export, and narrated audio generation. Most are wired into
-npm scripts in `front/package.json`; a few are invoked directly.
+Build-time utilities for the portfolio: content consistency, blog data
+generation, image optimization, diagram validation, PDF export, and
+narrated audio generation.
+
+For day-to-day work you only need one command: **`npm run sync`**. It
+orchestrates every step below in the right order with hash-aware caching,
+so re-runs are cheap and safe. The individual scripts are still callable
+for troubleshooting or surgical edits.
 
 ---
 
 ## Quick reference
 
 | Script | npm alias | Purpose |
 |---|---|---|
-| `build-blog-data.js` | `build-blog-data` (auto via `prebuild`) | Scan `public/blog/posts/` and emit `src/data/blogData.json` with metadata + audio manifest refs |
-| `optimize-images.js` | `optimize-images` | Produce WebP and size-capped JPEG/PNG variants under `public/images/` and `public/blog/` |
-| `validate-mermaid.js` | `validate-mermaid` | Lint Mermaid fenced blocks in every post against the renderer's v11 normalization |
-| `generate-blog-pdf.js` | `generate-pdf` | Compile all posts into a single styled PDF (`output/blog-compilation.pdf`) |
-| `generate_blog_audio.py` | `generate-audio` (EN only) | Render narrated MP3s per post in English or Spanish |
-| `generate_audio.sh` / `generate_audio.ps1` | — | One-shot wrapper: ensure Ollama is running, then run EN + ES |
-| `translate_ollama.py` | — | Ollama client used by the ES audio path (not invoked directly) |
-| `md_to_speech.py` | — | Markdown → narration-ready text preprocessor (imported by `generate_blog_audio.py`) |
+| `sync.py` | `sync` / `sync:fast` / `sync:check` | **Main entry point.** Clean orphans → validate Mermaid → optimize images → generate EN+ES audio → rebuild blog data |
+| `build-blog-data.js` | (auto via `prebuild`) | Scan `public/blog/posts/` and emit `src/data/blogData.json` |
+| `optimize-images.js` | `optimize-images` | WebP + size-capped variants (idempotent) |
+| `validate-mermaid.js` | `validate-mermaid` | Lint Mermaid fences against the renderer's v11 normalization |
+| `generate-blog-pdf.js` | `generate-pdf` | Compile all posts into `output/blog-compilation.pdf` |
+| `generate_blog_audio.py` | — | Render narrated MP3s per post (EN or ES). Normally called via `sync` |
+| `translate_ollama.py` | — | Ollama client used by the ES audio path |
+| `md_to_speech.py` | — | Markdown → narration-ready text preprocessor |
 
 ---
 
-## Blog data generation
+## The `sync` orchestrator
 
-`build-blog-data.js` runs automatically before every `npm run build` (via the
-`prebuild` hook) and whenever you want a fresh manifest in dev:
+```bash
+npm run sync                  # full pipeline
+npm run sync:fast             # skip Spanish audio (quick iteration on text)
+npm run sync:check            # validate only; no side effects (used in CI)
+
+# Pass-through flags via -- :
+npm run sync -- --only <slug>
+npm run sync -- --force
+npm run sync -- --dry-run
+```
+
+### What each step does
+
+| # | Step | Notes |
+|---|---|---|
+| 1 | Discover posts | Scans `public/blog/posts/**/*.md` → canonical `(category, slug)` set |
+| 2 | Clean orphan audio | Deletes MP3/JSON/narration.json whose post no longer exists (rename or category move). In `--check` mode, fails instead of deleting |
+| 3 | Validate Mermaid | Fail-fast before any expensive work; errors block, warnings are advisory |
+| 4 | Optimize images | Idempotent; warnings don't block |
+| 5 | English audio | `generate_blog_audio.py --lang en` (hash cache) |
+| 6 | Spanish audio | Auto-starts `ollama serve` if needed; if Ollama isn't installed, **warns and skips** instead of failing. Hash cache applies |
+| 7 | Rebuild blog data | Writes `src/data/blogData.json` so local `npm start` sees the fresh state |
+
+Steps 1–3 run in `--check`. All seven run in the full pipeline.
+
+### Exit codes
+
+- `0` — success (or intentional skips)
+- `1` — inconsistency detected (`--check`) or a required step failed
 
+### Typical workflows
+
+**New post, full treatment:**
 ```bash
-npm run build-blog-data
+# edit public/blog/posts/<category>/<slug>.md
+npm run sync
+git add -A && git commit -m "post: <title>"
 ```
 
-It parses YAML front-matter from each `.md` under `public/blog/posts/<category>/`,
-merges in audio manifest data from `public/blog/audio/manifest.json` and
-`public/blog/audio-es/manifest-es.json` when present, and writes
-`src/data/blogData.json`. The React app reads only this JSON — it never touches
-raw markdown at runtime.
+**Iterating on text (skip slow ES translation):**
+```bash
+npm run sync:fast
+```
+
+**Renaming or moving a post:** just rename the `.md` file and run `npm run sync`.
+Step 2 detects the old audio as orphan and deletes it; step 5/6 regenerates
+under the new name. No manual cleanup.
+
+**Surgical regeneration of a single post:**
+```bash
+npm run sync -- --only <slug> --force
+```
+
+---
+
+## Blog data generation
+
+`build-blog-data.js` runs automatically before every `npm run build` (via the
+`prebuild` hook), and is also step 7 of `sync`. It parses YAML front-matter from
+each `.md` under `public/blog/posts/<category>/`, merges in audio manifest data
+from `public/blog/audio/manifest.json` and `public/blog/audio-es/manifest-es.json`,
+and writes `src/data/blogData.json`. The React app reads only this JSON — it
+never touches raw markdown at runtime.
 
 ---
 
@@ -54,42 +110,10 @@ loads them via the generated manifests.
 - **Ollama** installed and on `PATH` — required **only** for Spanish, which
   translates the English narration with a local LLM. Default model:
   `gemma4:latest`. Install models with `ollama pull gemma4:latest`.
+  If Ollama is not present, `npm run sync` skips Spanish audio with a warning
+  instead of failing.
 - **ffmpeg** is **not** required — `edge-tts` emits MP3 directly.
 
-### One-shot generation (recommended)
-
-Use the wrapper script. It pings Ollama, launches `ollama serve` in the
-background if needed, waits up to 30 s for readiness, then runs both
-languages in sequence.
-
-```bash
-# Bash / WSL / macOS / git-bash
-./front/scripts/generate_audio.sh
-
-# Windows PowerShell
-.\front\scripts\generate_audio.ps1
-```
-
-Both accept pass-through flags that forward to `generate_blog_audio.py`:
-
-```bash
-./front/scripts/generate_audio.sh --only attention-is-all-you-need
-./front/scripts/generate_audio.sh --force
-./front/scripts/generate_audio.sh --limit 5
-```
-
-### Direct invocation
-
-If you only need one language or want finer control:
-
-```bash
-cd front
-python -u scripts/generate_blog_audio.py --lang en
-python -u scripts/generate_blog_audio.py --lang es --translate-model gemma4:latest
-```
-
-Flags: `--only <slug>`, `--force`, `--limit N`, `--dry-run`, `--verbose`.
-
 ### How it works
 
 Per post, `generate_blog_audio.py`:
@@ -107,7 +131,8 @@ Per post, `generate_blog_audio.py`:
 5. Rewrites the per-language manifest so `build-blog-data.js` can merge it.
 
 The cache is content-addressable: if neither the narration source nor the voice
-changed, the post is skipped. Safe to re-run as often as you like.
+changed, the post is skipped. Edits to code blocks, diagrams, math, or front-matter
+do **not** invalidate the audio cache — only changes to narratable prose do.
 
 ### Output layout
 
@@ -122,42 +147,40 @@ front/public/blog/
 ```
 
 All of the above is committed to the repo — audio is **not** regenerated on
-deploy (see commit `c743647`).
+deploy. CI only validates consistency via `sync:check`.
 
 ### Troubleshooting
 
-**ES generation hangs forever.** Historically this happened when `ollama serve`
-wasn't running and Python sat in TCP `SYN_SENT` retries. The current client
+**ES generation hangs forever.** Historical bug: when `ollama serve` wasn't
+running, Python sat in TCP `SYN_SENT` retries. The current client
 (`translate_ollama.py`) does a fast socket pre-check, retries with exponential
 backoff, and respects `OLLAMA_CALL_TIMEOUT` / `OLLAMA_MAX_RETRIES` env vars.
-If it still stalls, verify `curl -sf http://localhost:11434/api/tags` responds.
+Additionally, `sync.py` auto-starts `ollama serve` if it's not already running.
 
-**"Cannot reach Ollama" after several retries.** Either Ollama isn't installed,
-the binary isn't on `PATH`, or the model you requested isn't pulled. Run
-`ollama list` and `ollama pull gemma4:latest`.
+**"Ollama did not become ready in 30s."** Check `curl -sf http://localhost:11434/api/tags`.
+Run `ollama list` to confirm the requested model is pulled
+(`ollama pull gemma4:latest`).
 
-**Long initial run.** Full regeneration of ~70 posts in Spanish takes **hours**
+**Long initial run.** Full Spanish regeneration of ~70 posts takes **hours**
 on CPU-only or modest GPUs (≈80 s per 3.5k-char chunk on an 8B model × ~5 chunks
-per post). The job is resumable — the hash cache means interrupting and restarting
-skips everything already done.
+per post). The job is resumable — interrupting and restarting skips everything
+already done via the hash cache.
 
-**PowerShell execution policy blocks the wrapper.** Run it once with
-`powershell -ExecutionPolicy Bypass -File .\front\scripts\generate_audio.ps1`
-or set the policy permanently for your user.
+**I only want to iterate on text and skip ES.** Use `npm run sync:fast`.
 
 ---
 
 ## Image optimization
 
 ```bash
-npm run optimize-images                 # all directories
+npm run optimize-images                 # all directories (invoked by sync step 4)
 node scripts/optimize-images.js --blog  # blog images only
 ```
 
 Creates WebP versions and resized JPEG/PNG variants in place. Idempotent —
-re-runs skip files that already have a `-optimized` counterpart. Originals are
-preserved with a `-original` suffix and git-ignored (see `front/.gitignore`).
-CI runs this only when new unoptimized images are detected.
+re-runs skip files that already have a counterpart. Originals are preserved
+with a `-original` suffix and git-ignored (see `front/.gitignore`). CI runs
+this only when new unoptimized images are detected.
 
 ---
 
@@ -169,7 +192,11 @@ npm run validate-mermaid
 
 Parses every fenced \`\`\`mermaid block in `public/blog/posts/**`, applies the
 same normalization `PostRenderer` uses at runtime, and flags patterns that
-Mermaid v11 rejects. Use before pushing a post that contains diagrams.
+Mermaid v11 rejects.
+
+- **Errors** (exit 1): block deploy. Currently none emitted.
+- **Warnings** (exit 0): advisory; review when possible.
+- **Info**: stylistic notes.
 
 > **Gotcha:** diagram fences must open with \`\`\`mermaid — never
 > \`\`\`flowchart or \`\`\`timeline. The renderer keys off the fence language.
@@ -183,6 +210,6 @@ npm run generate-pdf
 ```
 
 Renders every post into a single styled PDF at `output/blog-compilation.pdf`
-using PDFKit. Cover page, TOC, per-category chapters, inline KaTeX via
-MathJax-rendered SVG, and syntax-highlighted code blocks. Not part of the
-deploy pipeline — run on demand when you want a printable snapshot.
+using PDFKit. Cover page, TOC, per-category chapters, KaTeX via MathJax-rendered
+SVG, and syntax-highlighted code blocks. Not part of the deploy pipeline — run
+on demand when you want a printable snapshot.