Skip to content

Discussion: Notebook format and LLM-friendliness — should we change how docs are stored in git? #844

@daimon-pymclabs

Description

@daimon-pymclabs

This is a discussion topic, not a bug or feature request. (The GitHub token can't create Discussions, so using an issue.)

Motivation

A recent analysis of repo token counts across causal inference libraries showed CausalPy at ~10.7M tokens — far more than Google's CausalImpact (~52K) or Meta's GeoLift (~166K). But the comparison is misleading for two reasons:

  1. CausalPy covers 10+ quasi-experimental methods (DiD, staggered DiD, synthetic control, RD, RK, ITS, IV, IPW, ANCOVA) with both Bayesian and OLS variants. CausalImpact and GeoLift each do one thing. On a per-method basis, the content is comparable.

  2. ~90% of the raw token count is base64-encoded PNG images embedded in executed .ipynb files. The actual meaningful content (source code, notebook source cells, markdown docs) is ~848K tokens. The inflated number is an artifact of how notebooks store outputs, not a sign of bloat.

That said, this raises a real question: as AI tools (coding agents, RAG systems, embedding pipelines, repo indexers) become a bigger part of how people discover, learn, and contribute to libraries, should we rethink how documentation is stored in the repo?

The current state

  • 35 executed .ipynb notebooks in docs/source/notebooks/
  • Outputs (plots, tables, MCMC traces) are committed with the notebooks
  • The docs site (ReadTheDocs) renders from these committed notebooks
  • Git diffs on notebooks are effectively unreadable due to embedded binary data

Options

Option 1: Do nothing

Modern coding agents (Claude Code, Cursor, Copilot) don't dump the entire repo into context — they search and read files on demand. Claude Code has a native .ipynb reader that skips base64 data. For the primary use case of agents helping developers work on CausalPy, the current format works fine.

Pros: No migration effort. No CI changes. No risk of breaking the docs build.

Cons: RAG/embedding tools that index the full repo will still ingest base64 blobs. Git diffs on notebooks remain unreadable. Clone size stays large. Naive token counters will continue to produce misleading comparisons.

Option 2: nbstripout (strip outputs before commit)

Add nbstripout as a git filter via .gitattributes. Outputs are automatically stripped when staging notebooks, so only source cells are committed. The docs site still renders with outputs by executing notebooks during the build.

Pros: Minimal migration — one config change, one bulk strip. Repo drops from ~10.7M to ~1.1M naive tokens. Git diffs become reviewable. Clone size shrinks dramatically.

Cons: Notebooks on GitHub's web UI will render without outputs (no plots visible when browsing on github.com). CI must execute all notebooks on docs builds, which means running PyMC sampling for 35 notebooks — see CI execution cost section below.

Option 2b: nbstripout on feature branches only

A variant of Option 2: keep outputs committed on main (re-generated by a merge-triggered or scheduled CI job using make run_notebooks_full per #672), but use nbstripout on feature branches so PRs have clean diffs.

Pros: main always has outputs for GitHub browsing and docs. PR diffs are reviewable. No persistent cache infrastructure needed — the CI job on merge writes outputs directly back to the repo.

Cons: More workflow complexity (different git filter behaviour per branch context). The merge-to-main execution job still has to run all 35 notebooks periodically. Need to be careful about merge conflicts between the stripped PR and the output-containing main.

Option 3: Convert to MyST Markdown via jupytext

Replace .ipynb files with .md files using MyST Markdown format, paired with jupytext for execution.

Pros: Markdown is natively LLM-friendly — no JSON wrapper, no embedded binary. Files are cleanly diffable. Standard git workflows (blame, merge, review) work naturally. Outputs are generated at build time.

Cons: Bigger migration effort — every notebook needs conversion and validation. Contributors used to Jupyter must learn the MyST format or use jupytext to sync between formats. Same CI execution cost as Option 2. Less mature tooling for interactive editing compared to .ipynb.

Option 4: Hybrid approach

Keep .ipynb for interactive development but use nbstripout to keep committed versions clean, and add an ARCHITECTURE.md (#843) to give agents and contributors a structural overview without needing to read the notebooks.

Pros: Pragmatic middle ground. Preserves the Jupyter workflow contributors know. Gets the git cleanliness benefits. Architecture doc provides the navigational aid.

Cons: Still has the GitHub web UI limitation (no rendered outputs). Doesn't go as far as MyST for LLM-friendliness.

CI execution cost (applies to Options 2, 2b, 3, and 4)

Stripping or externalising outputs means the docs build must execute notebooks. For CausalPy this involves PyMC MCMC sampling across 35 notebooks, which is non-trivial.

The cache problem: jupyter-cache can avoid re-executing unchanged notebooks by caching results keyed on cell content hashes. However, CI environments (GitHub Actions, ReadTheDocs) start from a clean state each run. On GitHub Actions you can persist the cache directory via actions/cache, but it's subject to eviction (10GB cap per repo, 7-day expiry on unused entries). On ReadTheDocs there's no built-in persistent cache between builds at all. Cache misses — from cell changes, expiry, or new branches — still pay the full execution cost. This adds infrastructure that needs maintaining and debugging.

Mitigation strategies (if stripping outputs):

  • Option 2b avoids the cache problem entirely by keeping outputs on main and only stripping on feature branches.
  • Gate on file changes: Only trigger full notebook execution when files in docs/ or causalpy/ change.
  • Lighter sampling for CI: Use fewer draws/chains in CI (via environment variable or config), with full-quality outputs only for release builds.
  • Execute only on merge to main: PRs validate syntax and imports; full execution happens on merge.

Related issues

Several existing issues touch on the same pain points around notebook format, docs tooling, and developer experience:

Questions

  • Is the current format actually causing problems for contributors or users today, or is this primarily about future-proofing for AI tooling?
  • How important is it that notebooks render with outputs on GitHub's web UI vs. the docs site?
  • If we go with nbstripout or MyST, is the CI execution cost acceptable given the cache limitations?
  • Should this be coordinated across PyMC ecosystem projects (pymc-marketing, CausalPy, etc.) or decided per-repo?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions