Discussion: Notebook format and LLM-friendliness — should we change how docs are stored in git?

> This is a discussion topic, not a bug or feature request. (The GitHub token can't create Discussions, so using an issue.)

## Motivation

A recent analysis of repo token counts across causal inference libraries showed CausalPy at ~10.7M tokens — far more than Google's CausalImpact (~52K) or Meta's GeoLift (~166K). But the comparison is misleading for two reasons:

1. **CausalPy covers 10+ quasi-experimental methods** (DiD, staggered DiD, synthetic control, RD, RK, ITS, IV, IPW, ANCOVA) with both Bayesian and OLS variants. CausalImpact and GeoLift each do one thing. On a per-method basis, the content is comparable.

2. **~90% of the raw token count is base64-encoded PNG images** embedded in executed `.ipynb` files. The actual meaningful content (source code, notebook source cells, markdown docs) is ~848K tokens. The inflated number is an artifact of how notebooks store outputs, not a sign of bloat.

That said, this raises a real question: as AI tools (coding agents, RAG systems, embedding pipelines, repo indexers) become a bigger part of how people discover, learn, and contribute to libraries, **should we rethink how documentation is stored in the repo?**

## The current state

- 35 executed `.ipynb` notebooks in `docs/source/notebooks/`
- Outputs (plots, tables, MCMC traces) are committed with the notebooks
- The docs site (ReadTheDocs) renders from these committed notebooks
- Git diffs on notebooks are effectively unreadable due to embedded binary data

## Options

### Option 1: Do nothing

Modern coding agents (Claude Code, Cursor, Copilot) don't dump the entire repo into context — they search and read files on demand. Claude Code has a native `.ipynb` reader that skips base64 data. For the primary use case of agents helping developers work on CausalPy, the current format works fine.

**Pros:** No migration effort. No CI changes. No risk of breaking the docs build.

**Cons:** RAG/embedding tools that index the full repo will still ingest base64 blobs. Git diffs on notebooks remain unreadable. Clone size stays large. Naive token counters will continue to produce misleading comparisons.

### Option 2: nbstripout (strip outputs before commit)

Add [nbstripout](https://github.com/kynan/nbstripout) as a git filter via `.gitattributes`. Outputs are automatically stripped when staging notebooks, so only source cells are committed. The docs site still renders with outputs by executing notebooks during the build.

**Pros:** Minimal migration — one config change, one bulk strip. Repo drops from ~10.7M to ~1.1M naive tokens. Git diffs become reviewable. Clone size shrinks dramatically.

**Cons:** Notebooks on GitHub's web UI will render without outputs (no plots visible when browsing on github.com). CI must execute all notebooks on docs builds, which means running PyMC sampling for 35 notebooks — see CI execution cost section below.

### Option 2b: nbstripout on feature branches only

A variant of Option 2: keep outputs committed on `main` (re-generated by a merge-triggered or scheduled CI job using `make run_notebooks_full` per #672), but use nbstripout on feature branches so PRs have clean diffs.

**Pros:** `main` always has outputs for GitHub browsing and docs. PR diffs are reviewable. No persistent cache infrastructure needed — the CI job on merge writes outputs directly back to the repo.

**Cons:** More workflow complexity (different git filter behaviour per branch context). The merge-to-main execution job still has to run all 35 notebooks periodically. Need to be careful about merge conflicts between the stripped PR and the output-containing `main`.

### Option 3: Convert to MyST Markdown via jupytext

Replace `.ipynb` files with `.md` files using [MyST Markdown](https://mystmd.org/) format, paired with [jupytext](https://jupytext.readthedocs.io/) for execution.

**Pros:** Markdown is natively LLM-friendly — no JSON wrapper, no embedded binary. Files are cleanly diffable. Standard git workflows (blame, merge, review) work naturally. Outputs are generated at build time.

**Cons:** Bigger migration effort — every notebook needs conversion and validation. Contributors used to Jupyter must learn the MyST format or use jupytext to sync between formats. Same CI execution cost as Option 2. Less mature tooling for interactive editing compared to `.ipynb`.

### Option 4: Hybrid approach

Keep `.ipynb` for interactive development but use nbstripout to keep committed versions clean, and add an `ARCHITECTURE.md` (#843) to give agents and contributors a structural overview without needing to read the notebooks.

**Pros:** Pragmatic middle ground. Preserves the Jupyter workflow contributors know. Gets the git cleanliness benefits. Architecture doc provides the navigational aid.

**Cons:** Still has the GitHub web UI limitation (no rendered outputs). Doesn't go as far as MyST for LLM-friendliness.

## CI execution cost (applies to Options 2, 2b, 3, and 4)

Stripping or externalising outputs means the docs build must execute notebooks. For CausalPy this involves PyMC MCMC sampling across 35 notebooks, which is non-trivial.

**The cache problem:** `jupyter-cache` can avoid re-executing unchanged notebooks by caching results keyed on cell content hashes. However, CI environments (GitHub Actions, ReadTheDocs) start from a clean state each run. On GitHub Actions you can persist the cache directory via `actions/cache`, but it's subject to eviction (10GB cap per repo, 7-day expiry on unused entries). On ReadTheDocs there's no built-in persistent cache between builds at all. Cache misses — from cell changes, expiry, or new branches — still pay the full execution cost. This adds infrastructure that needs maintaining and debugging.

**Mitigation strategies (if stripping outputs):**

- **Option 2b** avoids the cache problem entirely by keeping outputs on `main` and only stripping on feature branches.
- **Gate on file changes**: Only trigger full notebook execution when files in `docs/` or `causalpy/` change.
- **Lighter sampling for CI**: Use fewer draws/chains in CI (via environment variable or config), with full-quality outputs only for release builds.
- **Execute only on merge to main**: PRs validate syntax and imports; full execution happens on merge.

## Related issues

Several existing issues touch on the same pain points around notebook format, docs tooling, and developer experience:

- **#617** [CLOSED] — **Pilot: Evaluate Marimo Notebooks for Documentation.** Already explored an alternative notebook format (Marimo) for the same reasons: poor git diffs, hidden state, noisy execution counts. The pilot converted `did_pymc.ipynb` to Marimo. Any decision here should consider those findings.
- **#672** [OPEN] — **Add full notebook re-execution command.** Directly relevant to the CI cost question — if we strip outputs, we need a reliable way to re-execute all notebooks. This issue proposes a `make run_notebooks_full` target.
- **#562** [OPEN] — **Create developer-focused `RESEARCH.md`.** Very similar motivation to #843 (ARCHITECTURE.md) — a codebase map for developers and AI agents. These should likely be consolidated.
- **#770** [CLOSED] — **Add pre-commit nbformat validation for notebooks.** Already addressed notebook integrity in the pre-commit pipeline. Relevant if we add nbstripout to the same pipeline.
- **#574** [OPEN] — **Apply stricter ruff linting rules to documentation notebooks.** Currently docs notebooks are excluded from strict linting. Related to overall notebook quality.
- **#843** [OPEN] — **Add ARCHITECTURE.md.** Valuable regardless of which option is chosen here.

## Questions

- Is the current format actually causing problems for contributors or users today, or is this primarily about future-proofing for AI tooling?
- How important is it that notebooks render with outputs on GitHub's web UI vs. the docs site?
- If we go with nbstripout or MyST, is the CI execution cost acceptable given the cache limitations?
- Should this be coordinated across PyMC ecosystem projects (pymc-marketing, CausalPy, etc.) or decided per-repo?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Notebook format and LLM-friendliness — should we change how docs are stored in git? #844

Motivation

The current state

Options

Option 1: Do nothing

Option 2: nbstripout (strip outputs before commit)

Option 2b: nbstripout on feature branches only

Option 3: Convert to MyST Markdown via jupytext

Option 4: Hybrid approach

CI execution cost (applies to Options 2, 2b, 3, and 4)

Related issues

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Discussion: Notebook format and LLM-friendliness — should we change how docs are stored in git? #844

Description

Motivation

The current state

Options

Option 1: Do nothing

Option 2: nbstripout (strip outputs before commit)

Option 2b: nbstripout on feature branches only

Option 3: Convert to MyST Markdown via jupytext

Option 4: Hybrid approach

CI execution cost (applies to Options 2, 2b, 3, and 4)

Related issues

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions