Chat collapses time. nanoIM restores it.
A small from-scratch interaction-model lab for the temporal-aliasing bottleneck.
Controlled symbolic evidence for a representation bottleneck. Trains from scratch, no API keys, fully offline.
nanoIM is a compact research artifact for one question:
If two interactions flatten to the same final chat transcript but require different next actions, what can a transcript-only model know?
The answer is structural. A transcript-only model cannot separate the pair because its input is identical. A native micro-turn model can, if the missing variables are preserved: silence, timing, overlap, interruptions, visual cues, policy boundaries, and asynchronous tool events.
The repository is intentionally small enough to read like nanoGPT, but complete enough to generate data, train models from scratch, evaluate baselines, render proof galleries, and audit leakage.
| Layer | What is included | Why it matters |
|---|---|---|
| Formal spine | paper/temporal_aliasing.md, paper/the_interaction_bottleneck.md |
States the impossibility argument and experimental method. |
| Data lab | symbolic 200 ms-style micro-turn traces across 10 task families | Tests interaction states that disappear in chat transcripts. |
| Fair baselines | transcript majority, transcript naive Bayes, transcript oracle upper bound, rule harness, field-lookup table | Separates structural transcript failure from weak modeling. |
| Trainable models | tiny GRU and tiny Transformer, stream/time embeddings, action head | Proves the effect with from-scratch local models. |
| Anti-cheat gates | alias equality tests, no transcript leakage, destructive controls | Makes shortcut explanations harder. |
| Reviewer proof | machine-readable scorecards, timeline gallery, honest limitations | Gives skeptics direct artifacts to inspect. |
Turn-based chat is a lossy serialization format. It turns interaction into alternating messages and erases the variables that often decide whether a system should wait, speak, yield, interrupt, request approval, resume after a tool result, or integrate a background event.
nanoIM studies that bottleneck as temporal aliasing:
flatten(trace_a) == flatten(trace_b)
target_action(trace_a) != target_action(trace_b)
Any policy that only receives flatten(trace) must assign the same action distribution to both traces. If the required actions differ, it must fail at least one member of the pair. That is the core claim, and the repo turns it into executable tests.
Held-out test split, hard suite:
| System | Temporal aliasing accuracy | Paired separation | Notes |
|---|---|---|---|
| Transcript majority | 0.300 |
0.000 |
transcript only |
| Transcript naive Bayes | 0.467 |
0.000 |
transcript only |
| Transcript oracle upper bound | 0.500 |
0.000 |
structural ceiling for any transcript-only policy |
| Rule harness (stream-aware) | 1.000 |
1.000 |
hand-written timing/stream rules |
| Field-lookup table (stream-aware) | 1.000 |
1.000 |
memorized dict over stream fields |
| MicroTurn Tiny GRU | 1.000 +/- 0.0000 |
1.000 +/- 0.0000 |
from scratch, seeds 3/7/11 |
| MicroTurn Tiny Transformer | 1.000 |
1.000 |
from scratch, seed 7 |
Held-out test split, noisy counterbalanced suite:
| System | Temporal aliasing accuracy | Paired separation | Notes |
|---|---|---|---|
| Transcript majority | 0.400 |
0.000 |
transcript only |
| Transcript naive Bayes | 0.4625 |
0.000 |
transcript only |
| Transcript oracle upper bound | 0.500 |
0.000 |
structural ceiling for any transcript-only policy |
| Rule harness (stream-aware) | 1.000 |
1.000 |
hand-written timing/stream rules |
| Field-lookup table (stream-aware) | 1.000 |
1.000 |
memorized dict over stream fields |
| MicroTurn Tiny GRU | 1.000 +/- 0.0000 |
1.000 +/- 0.0000 |
from scratch, seeds 3/7/11 |
| No-timing ablation (GRU) | 0.650 +/- 0.0000 |
0.300 +/- 0.0000 |
timing removed from the input |
Bootstrap 95% confidence intervals on per-example correctness from the hard-suite test split (10,000 resamples, seed 7):
| Method | TAA point estimate | TAA 95% CI |
|---|---|---|
| Transcript majority | 0.300 |
[0.217, 0.383] |
| Tiny GRU (seed 7) | 1.000 |
[1.000, 1.000] |
| Tiny Transformer (seed 7) | 1.000 |
[1.000, 1.000] |
| Rule harness | 1.000 |
[1.000, 1.000] |
Paired permutation tests (10,000 resamples, seed 7) for each stream-aware method against transcript majority all return p = 0.0000, consistent with p < 1e-4. Reproduce with uv run python -m nanoim.experiments.statistical_tests; full output in reports/statistical_tests.json.
Read this as a statement about the representation, not the model. Transcript-only policies are provably capped at 0.500 on balanced alias pairs: the oracle can assign only one action per identical transcript, so it must miss one member of every pair (paired separation 0.000). Every stream-aware method I tried (rules, lookup table, GRU, Transformer) saturates at 1.000. nanoIM does not claim the neural model beats symbolic baselines; in this controlled lab it ties them, which is the point. The trained model is an existence proof that the signal is learnable from scratch. The contribution is the 0.500 → 1.000 gap from restoring the streams (state and timing) the transcript discards.
What "held-out test split" does and does not mean. Train/val/test use disjoint transcript wordings, so a transcript-only model cannot win by memorizing text templates. But the streams that decide the action are identical across splits; only the filler wording differs. The
testnumbers therefore measure within-distribution reproduction plus robustness to decoy noise, not generalization to novel interaction structure or unseen task families. nanoIM makes no generalization claim for the learned model beyond this controlled setup.
The field-lookup table is deliberately included as a strong baseline, not a strawman (reports/hard_field_lookup_scorecard.json, reports/noisy_field_lookup_scorecard.json; reproduce with uv run python -m nanoim.eval --baseline field_lookup ...). That a plain dict saturates the task once it sees the streams is the point: the bottleneck is the transcript representation, not model capacity.
The structural argument predicts that any deterministic transcript-only policy is capped at TAA 0.500 on balanced alias pairs with paired separation 0.000 under this construction, regardless of capacity. I tested that against seven modern open-weights LLMs spanning five families and 3B–30B parameters, each given the same flattened transcript and asked for the next action. The parser uses max_tokens=2048 and falls back to the reasoning field for models that emit chain-of-thought (Qwen3 and GLM-4.7 do); every prediction is sourced from a real model output, not from a default. Zero parse failures across 840 LLM calls.
| Model | Family | Params | Type | TAA | Paired separation | n |
|---|---|---|---|---|---|---|
| Phi-4 14B | Microsoft | 14B | dense | 0.133 |
0.000 |
120 |
| Llama 3.2 3B | Meta | 3B | dense | 0.208 |
0.000 |
120 |
| Qwen3 4B | Alibaba | 4B | dense | 0.133 |
0.000 |
120 |
| Qwen3 14B MLX | Alibaba | 14B | dense | 0.158 |
0.000 |
120 |
| Qwen3 30B-A3B MLX | Alibaba | 30B (3B active) | MoE | 0.183 |
0.000 |
120 |
| Gemma 4 26B | 26B | dense | 0.208 |
0.000 |
120 | |
| GLM-4.7 Flash | Zhipu AI | 30B class | reasoning (2026) | 0.117 |
0.000 |
120 |
420 alias pairs evaluated, 420/420 with paired separation = 0.000. Every tested model, across family, parameter count, dense/MoE/reasoning architecture, and chain-of-thought behavior, assigns the same action to both members of every alias pair. The 19K-parameter from-scratch GRU that sees the streams reaches 1.000 / 1.000 on the same data. In this benchmark, the bottleneck is the representation, not the model. Per-model JSON in reports/llm_baseline_*.json; aggregate in reports/llm_baseline_summary.json. Reproduce with uv run python -m nanoim.experiments.llm_baseline --base-url http://127.0.0.1:11434/v1 --model qwen3:4b --out reports/llm_baseline_qwen3_4b.json (Ollama) or substitute the LM Studio endpoint and model name.
Same trained checkpoints, no retraining; each input field is blanked at inference time. The drop in temporal aliasing accuracy measures how much each architecture actually consults each field at decision time. Both architectures agree on the ranking:
| Field blanked | GRU TAA drop | Transformer TAA drop |
|---|---|---|
user_audio_state |
-0.350 |
-0.400 |
visual_event |
-0.183 |
-0.200 |
background_event |
-0.100 |
-0.100 |
t_ms / dt_ms |
-0.050 |
-0.050 |
model_audio_state |
-0.050 |
-0.050 |
policy_event |
-0.050 |
-0.050 |
user_audio_state is the single most load-bearing field. That fits the data: most task families turn on whether the user is speaking, silent, hesitating, or interrupting at the critical timestep. This is different from the retraining no_X ablations (where the model can adapt around a missing field): the question here is what the trained model relies on at inference. Reproduce with uv run python -m nanoim.experiments.occlusion --out reports/occlusion_analysis.json. Full per-field, per-architecture detail in reports/occlusion_analysis.json.
I separately tested whether each method generalizes to interaction structures it did not see during training. For each of the ten task families, I retrained a fresh tiny GRU on the other nine families and evaluated on the held-out family. The result is honest about which methods generalize and which do not:
Hard suite, mean across 10 held-out families × 3 seeds:
| Method | Mean LOFO TAA | Mean LOFO paired separation |
|---|---|---|
| Rule harness (invariant) | 1.000 |
1.000 |
| Tiny GRU | 0.344 |
0.156 |
| Tiny Transformer | 0.256 |
0.011 |
| Field-lookup table | 0.300 |
0.000 |
| Transcript majority | 0.300 |
n/a |
Noisy suite, mean across 10 held-out families × 3 seeds:
| Method | Mean LOFO TAA | Mean LOFO paired separation |
|---|---|---|
| Rule harness (invariant) | 1.000 |
1.000 |
| Tiny GRU | 0.429 |
0.075 |
| Tiny Transformer | 0.317 |
0.033 |
| Field-lookup table | 0.400 |
0.000 |
| Transcript majority | 0.400 |
n/a |
The rule harness is invariant by construction (its fit is a no-op), so it transfers trivially on every held-out family and the release verifier asserts this. The field-lookup table collapses to its global default action on stream keys it never saw. Both trained architectures underperform the transcript-majority baseline on the hard suite. The more expressive Transformer collapses further than the GRU on both suites, which is the right signal: more capacity means more family-specific overfitting, not more generalization. Source: reports/lofo_sweep.json and reports/lofo_sweep_noisy.json.
This is the kind of negative result that should ship with the positive one. The repo's claim is about the representation and the invariant rule harness, not about novel-structure generalization from the learned model. Reproduce with uv run python -m nanoim.experiments.lofo --out reports/lofo_sweep.json. Per-family numbers are in reports/lofo_sweep.json.
flowchart LR
A["Symbolic micro-turn generator"] --> B["Paired alias examples"]
B --> C["Anti-leakage tests"]
C --> D["Transcript baselines (oracle bound 0.50)"]
C --> E["Rule harness / field-lookup"]
C --> F["MicroTurn Tiny GRU / Transformer"]
F --> G["Ablations and destructive controls"]
D --> H["Machine-readable scorecards"]
E --> H
G --> H
H --> I["Timeline gallery"]
Requires Python 3.11 or 3.12. Install uv first, then:
uv sync --dev
uv run python -m nanoim --help
uv run python -m nanoim.data.generate --suite mini --out data/mini.jsonl
uv run python -m nanoim.train --config configs/tiny.yaml
uv run python -m nanoim.eval --checkpoint runs/tiny/best.pt --suite mini --out reports/scorecard.json
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline.html
uv run pytest
uv build
uv run python -m nanoim.package_audit --out reports/package_audit.jsonAfter activating .venv, the spec's bare commands work as written:
source .venv/bin/activate
python -m nanoim --help
python -m nanoim.data.generate --suite mini --out data/mini.jsonl
python -m nanoim.train --config configs/tiny.yaml
python -m nanoim.eval --checkpoint runs/tiny/best.pt --suite mini --out reports/scorecard.json
python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline.html
pytestIf you have ten minutes and want to attack the claim, start here:
docs/reviewer_guide.md- how to attack the claim using the committed artifacts.CLAIM_LEDGER.md- public claims tied to exact evidence, reviewer attacks, and non-claims.SOURCE_MAP.md- where every artifact, command, model, report, and test lives.reports/failure_report.md- preserved failure analysis and interpretation.reports/timeline_gallery.html- local visual proof gallery for same-transcript/different-timing pairs.
The commands below refresh committed artifact paths. Use --out /tmp/... for no-diff smoke runs.
If you are reviewing the claim skeptically, read these in order:
paper/temporal_aliasing.mdfor the toy impossibility argument.tests/test_aliasing.pyfor the anti-leakage contract.reports/scorecard.mdfor the human-readable result summary.reports/noisy_sweep.jsonandreports/adversarial_controls.jsonfor timing and shortcut checks.reports/timeline_gallery.htmlfor side-by-side same-transcript/different-timing examples.reports/failure_report.mdfor known failures and interpretation.HOSTILE_REVIEW.mdandAUTHOR_RESPONSE.mdfor the strongest objections and current answers.
nanoIM ships at least 10 interaction families, each designed so a transcript can be identical while the correct micro-turn action differs:
| Family | Missing variable | Example action contrast |
|---|---|---|
| yield detection | silence duration and user audio state | WAIT vs SPEAK |
| hesitation | filled pauses and non-final prosody proxy | WAIT vs BACKCHANNEL |
| barge-in recovery | overlap and interruption onset | STOP_SPEAKING vs WAIT |
| self-correction | revision timing | WAIT vs SPEAK |
| backchannel timing | short acknowledgement placement | BACKCHANNEL vs ASK_CLARIFICATION |
| clarification timing | visual/text confidence cue | WAIT vs ASK_CLARIFICATION |
| visual cue trigger | non-verbal event stream | ASK_CLARIFICATION vs SPEAK |
| approval interruption | policy boundary during speech | REQUEST_APPROVAL vs CALL_TOOL |
| async tool weaving | background tool result timing | CALL_TOOL vs WAIT |
| background result integration | external result arrives mid-turn | RESUME_WITH_RESULT vs SPEAK |
Both examples flatten to the same transcript:
User: book the 3 pm slot
But they require different actions:
| Trace | Preserved micro-turn state | Correct action |
|---|---|---|
| A | user is still hesitating, no approval cue | WAIT |
| B | user has yielded, approval cue is active | REQUEST_APPROVAL |
The transcript-only model sees identical input. The micro-turn model receives the streams that decide the action.
flowchart TB
subgraph Data["data layer"]
G["nanoim.data.generate"]
P["nanoim.data.pilot"]
HP["nanoim.data.human_protocol"]
end
subgraph Features["representation"]
TF["transcript-only features"]
MF["micro-turn stream/time features"]
end
subgraph Policies["policies"]
TM["transcript majority / NB / oracle"]
RH["rule harness"]
GRU["tiny GRU"]
TX["tiny Transformer"]
end
subgraph Evidence["evidence"]
EV["nanoim.eval"]
SW["nanoim.experiments.sweep"]
CT["nanoim.controls"]
VZ["nanoim.viz.render"]
VR["nanoim.verify_release"]
end
G --> TF
G --> MF
P --> MF
HP --> MF
TF --> TM
MF --> RH
MF --> GRU
MF --> TX
TM --> EV
RH --> EV
GRU --> EV
TX --> EV
EV --> SW
EV --> CT
EV --> VZ
SW --> VR
CT --> VR
The default model is deliberately small:
- stream embeddings for symbolic audio/text/visual/background/policy state;
- time/delta-time embeddings for 200 ms-style micro-turns;
- a compact GRU backbone with per-timestep action supervision;
- an action head for both critical-decision and trajectory metrics;
- optional tiny Transformer comparator with the same feature boundary.
Everything trains locally from scratch. No paid API keys, frontier APIs, hosted models, or hidden services are used.
Scorecards are machine-readable JSON. The main release metrics include:
| Metric | Meaning |
|---|---|
transcript_upper_bound |
best possible transcript-only ceiling under paired aliases |
temporal_aliasing_accuracy |
critical-action accuracy on alias pairs |
paired_separation_rate |
fraction of pairs where the policy separates both members correctly |
action_accuracy |
ordinary critical-action accuracy |
sequence_action_accuracy |
per-timestep trajectory accuracy |
wait_vs_speak_accuracy |
separation of wait/speak timing decisions |
false_interrupt_rate / missed_interrupt_rate |
interruption handling errors |
barge_in_recovery_time |
recovery latency after barge-in |
clarification_timing_score |
timing-sensitive clarification behavior |
tool_weaving_score |
background/tool-event integration |
turn_based_baseline_delta |
micro-turn gain over transcript-only baseline |
nanoIM is only interesting if the transcript baseline is fair. The repo therefore checks:
- alias-pair members share identical flattened transcripts;
- transcript features exclude pair IDs, task families, labels, event labels, timestamps, split IDs, and timing annotations;
- generated hard/noisy artifacts match deterministic generator output;
- destructive controls collapse performance when timing/stream/label structure is broken;
- public docs distinguish symbolic proof from real audio/video competence.
The relevant tests and checks are tests/test_aliasing.py, nanoim.verify_release (asserts the scientific contract — transcript oracle bound 0.50, paired separation 0.00, and large destructive-control drops), and nanoim.security_audit.
Hard-suite multi-seed input ablations:
| Variant | Temporal aliasing accuracy | Paired separation | Main observed hit |
|---|---|---|---|
| Full | 1.000 +/- 0.0000 |
1.000 +/- 0.0000 |
baseline |
| Small model | 0.983 +/- 0.0236 |
0.967 +/- 0.0471 |
capacity halved; hard suite remains learnable |
| No audio | 0.847 +/- 0.0039 |
0.694 +/- 0.0079 |
turn ownership / hesitation |
| No visual | 0.942 +/- 0.0068 |
0.883 +/- 0.0136 |
clarification timing |
| No background | 0.894 +/- 0.0039 |
0.789 +/- 0.0079 |
tool weaving |
| No policy | 0.939 +/- 0.0079 |
0.878 +/- 0.0157 |
approval/tool boundary |
| No timing | 0.950 +/- 0.0000 |
0.900 +/- 0.0000 |
hard suite mostly uses state streams |
Noisy-suite multi-seed ablations:
| Variant | Temporal aliasing accuracy | Paired separation | Main observed hit |
|---|---|---|---|
| Full | 1.000 +/- 0.0000 |
1.000 +/- 0.0000 |
baseline |
| No timing | 0.650 +/- 0.0000 |
0.300 +/- 0.0000 |
timing-only alias pairs |
| No background | 0.935 +/- 0.0078 |
0.871 +/- 0.0156 |
tool/result state |
| No policy | 0.950 +/- 0.0000 |
0.900 +/- 0.0000 |
approval boundaries |
| No visual | 0.944 +/- 0.0051 |
0.887 +/- 0.0102 |
visual cue trigger |
Destructive controls on noisy seed 7:
| Control | TAA | Drop from base |
|---|---|---|
| Base | 1.0000 |
n/a |
Scramble t_ms within each trace |
0.4000 |
0.6000 |
| Mismatch critical non-transcript streams across examples | 0.3188 |
0.6813 |
| Permute held-out target labels | 0.1967 |
0.8033 |
The visualizer writes a self-contained HTML gallery:
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline_gallery.htmlIt shows paired examples side by side with:
- the identical flattened transcript;
- micro-turn rows for user audio, text deltas, model state, visual events, background events, and policy events;
- target action vs micro-turn prediction;
- transcript-baseline prediction;
- highlighted row-level errors;
- timing annotations such as
0.2s-0.4s (200ms).
Open reports/timeline_gallery.html after running the command.
These are correctness checks, not a self-administered quality score. Each recomputes from the data and committed artifacts:
| Check | Command | What it asserts |
|---|---|---|
| Scientific contract | nanoim.verify_release |
transcript oracle bound stays 0.50, paired separation 0.00, multi-seed/Transformer clear the oracle delta, and the destructive controls cause large drops |
| Calibration | nanoim.calibration |
noisy checkpoint accuracy 1.000, ECE 0.0251, Brier 0.0074, NLL 0.0275 |
| Security hygiene | nanoim.security_audit |
no secrets, no leaked paths, no oversized tracked files |
| Package artifacts | nanoim.package_audit |
wheel/sdist metadata, console script, license, required modules |
| Hugging Face bundle | nanoim.hf_export + nanoim.hf_validate |
deterministic bundle, card/split/checksum/replay validation with explicit noisy-checkpoint replay thresholds |
| Reproducibility | nanoim.repro |
sha256 of every canonical source and evidence artifact |
What is intentionally not here: there is no consented natural human-event corpus, and no independent external review. Those are real-world steps the repo does not pretend to have completed. Hosted CI runs on every released tag and should be confirmed green before public citation.
The release includes a future-facing protocol for collecting consented human event logs:
uv run python -m nanoim.data.human_protocol \
--events data/human_events.jsonl \
--examples data/human_examples.jsonl \
--out reports/human_event_protocol_audit.jsonThis protocol validates schema, monotonic timing, consent metadata, redaction status, obvious identifier leakage, and, when candidate examples exist, the same-transcript/different-action alias-pair contract. Passing without a corpus is not real-world validation; it only proves the repo is ready to collect and audit a real corpus without changing the claim boundary.
nanoIM includes local Hub-ready package templates for a dataset repo, a model repo, and a static Space:
uv run python -m nanoim.hf_export --out dist/huggingface --manifest-out reports/huggingface_release_manifest.json
uv run python -m nanoim.hf_validate --bundle dist/huggingface --manifest reports/huggingface_release_manifest.json --out reports/huggingface_offline_validation.json
uv run python -m nanoim.release_bundle --out dist/release/nanoim-0.1.5 --manifest-out reports/release_bundle_manifest.jsondist/huggingface/dataset contains the JSONL data and dataset card. dist/huggingface/model contains checkpoints, configs, scorecards, calibration evidence, and a model card. dist/huggingface/space contains a static proof-gallery front door. The offline validator checks cards, repo IDs, split files, checksums, manifests, alias contracts, model-card coverage for every exported checkpoint, and exported noisy-checkpoint replay before upload. The replay gate is near-perfect rather than exact (TAA >= 0.98, paired separation >= 0.97) and is applied to the committed release checkpoint. Full CI separately retrains models for reproduction coverage, then restores the committed release-packet inputs before building the publication bundle. The export is local-only and does not upload credentials or call the Hugging Face API.
See docs/huggingface_release.md and docs/release_engineering.md for the pre-upload checklist, .hfignore hygiene, SBOM/checksum bundle, and provenance workflow.
The full reproduction path is intentionally explicit:
uv run python -m nanoim.data.generate --suite mini --out data/mini.jsonl
uv lock --check
uv build
uv run python -m nanoim.package_audit --out reports/package_audit.json
uv run python -m nanoim.data.generate --suite hard --out data/hard.jsonl
uv run python -m nanoim.data.generate --suite noisy --seed 7 --out data/noisy.jsonl
uv run python -m nanoim.data.pilot --from-events data/pilot_events.jsonl --events data/pilot_events.jsonl --out data/pilot.jsonl
uv run python -m nanoim.train --config configs/tiny.yaml
uv run python -m nanoim.eval --checkpoint runs/tiny/best.pt --suite mini --out reports/scorecard.json
uv run python -m nanoim.eval --baseline transcript --suite mini --out reports/transcript_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite mini --out reports/rule_scorecard.json
uv run python -m nanoim.train --config configs/hard.yaml
uv run python -m nanoim.eval --checkpoint runs/hard/full/seed_7/best.pt --suite hard --data data/hard.jsonl --out reports/hard_scorecard.json
uv run python -m nanoim.eval --baseline transcript --suite hard --data data/hard.jsonl --out reports/hard_transcript_scorecard.json
uv run python -m nanoim.eval --baseline transcript_nb --suite hard --data data/hard.jsonl --out reports/hard_transcript_nb_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite hard --data data/hard.jsonl --out reports/hard_rule_scorecard.json
uv run python -m nanoim.experiments.sweep --suite hard --seeds 3,7,11 --ablations 'full;small_model;no_audio;no_visual;no_background;no_policy;no_timing' --epochs 45 --data-seed 7 --out reports/hard_sweep.json
uv run python -m nanoim.train --config configs/transformer_hard.yaml
uv run python -m nanoim.eval --checkpoint runs/transformer/hard/seed_7/best.pt --suite hard --data data/hard.jsonl --out reports/transformer_hard_scorecard.json
uv run python -m nanoim.controls --data data/hard.jsonl --checkpoint runs/transformer/hard/seed_7/best.pt --suite hard --out reports/transformer_hard_controls.json
uv run python -m nanoim.train --config configs/noisy.yaml
uv run python -m nanoim.eval --checkpoint runs/noisy/full/seed_7/best.pt --suite noisy --data data/noisy.jsonl --out reports/noisy_scorecard.json
uv run python -m nanoim.calibration --checkpoint runs/noisy/full/seed_7/best.pt --suite noisy --data data/noisy.jsonl --out reports/calibration_audit.json
uv run python -m nanoim.eval --baseline transcript --suite noisy --data data/noisy.jsonl --out reports/noisy_transcript_scorecard.json
uv run python -m nanoim.eval --baseline transcript_nb --suite noisy --data data/noisy.jsonl --out reports/noisy_transcript_nb_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite noisy --data data/noisy.jsonl --out reports/noisy_rule_scorecard.json
uv run python -m nanoim.eval --baseline field_lookup --suite hard --data data/hard.jsonl --out reports/hard_field_lookup_scorecard.json
uv run python -m nanoim.eval --baseline field_lookup --suite noisy --data data/noisy.jsonl --out reports/noisy_field_lookup_scorecard.json
uv run python -m nanoim.eval --checkpoint runs/hard/full/seed_7/best.pt --suite pilot --data data/pilot.jsonl --eval-split all --out reports/pilot_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite pilot --data data/pilot.jsonl --eval-split all --out reports/pilot_rule_scorecard.json
uv run python -m nanoim.experiments.sweep --suite noisy --seeds 3,7,11 --ablations 'full;no_timing;no_audio;no_visual;no_background;no_policy' --epochs 120 --data-seed 7 --out reports/noisy_sweep.json
uv run python -m nanoim.controls --data data/noisy.jsonl --checkpoint runs/noisy/full/seed_7/best.pt --suite noisy --out reports/adversarial_controls.json
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline.html
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline_gallery.html
uv run python -m nanoim.verify_release --out reports/release_verification.json
uv run python -m nanoim.data.human_protocol --events data/human_events.jsonl --examples data/human_examples.jsonl --out reports/human_event_protocol_audit.json
uv run python -m nanoim.hf_export --out dist/huggingface --manifest-out reports/huggingface_release_manifest.json
uv run python -m nanoim.hf_validate --bundle dist/huggingface --manifest reports/huggingface_release_manifest.json --out reports/huggingface_offline_validation.json
uv run python -m nanoim.security_audit --out reports/security_audit.json
uv run python -m nanoim.repro --out reports/reproducibility.json
uv run python -m nanoim.release_bundle --out dist/release/nanoim-0.1.5 --manifest-out reports/release_bundle_manifest.json| Path | Role |
|---|---|
nanoim/data/generate.py |
deterministic symbolic suite generator |
nanoim/data/pilot.py |
raw event-log to micro-turn importer |
nanoim/data/human_protocol.py |
future consented human-event protocol and alias-contract validator |
nanoim/features.py |
transcript and micro-turn feature boundaries |
nanoim/model.py |
tiny GRU and Transformer models |
nanoim/train.py |
from-scratch training loop |
nanoim/eval.py |
scorecard evaluator |
nanoim/calibration.py |
checkpoint confidence-calibration audit |
nanoim/controls.py |
destructive controls |
nanoim/viz/render.py |
timeline HTML renderer |
nanoim/baselines.py |
transcript majority/NB, rule harness, field-lookup table |
nanoim/verify_release.py |
asserts the scientific result contract |
nanoim/security_audit.py |
local secret/path/size hygiene scan |
nanoim/hf_export.py |
Hugging Face dataset/model/Space export |
nanoim/hf_validate.py |
offline Hugging Face bundle validator |
nanoim/package_audit.py |
wheel/sdist artifact auditor |
nanoim/release_bundle.py |
checksummed release bundle and SBOM |
nanoim/repro.py |
sha256 manifest of canonical artifacts |
SOURCE_MAP.md |
reviewer map from claims to files |
CLAIM_LEDGER.md |
public claims, evidence, non-claims |
- The substrate is symbolic, not real audio/video.
- The pilot importer uses scripted local events with measured timing; it proves an ingestion path, not natural audio/video competence.
- The result shows representation value in a controlled lab, not broad conversational intelligence.
- The hard suite primarily tests state-stream aliasing; the noisy suite adds counterbalanced timing-only cases and is the timing-ablation evidence.
- The human-event protocol is ready, but no consented natural human-event corpus is included.
- Public claims should say "controlled symbolic evidence for a representation bottleneck," not "solved interaction modeling."
- The core release path is offline: no API keys, no hosted-model dependency. The only module that makes outbound HTTP calls is the optional
nanoim.experiments.llm_baseline, which talks to a user-supplied OpenAI-compatible endpoint (Ollama at127.0.0.1:11434or LM Studio at127.0.0.1:1234by default) and is not invoked by any release gate.
RESULTS.md- one-page summary of every headline number with its source artifact and reproduction command.FAQ.md- the questions a curious reader actually asks.CHANGELOG.md- what has shipped in each release.CONTRIBUTING.md- how to push back on the central claim or extend the evidence.SECURITY.md- threat model and reporting process.CODE_OF_CONDUCT.md- Contributor Covenant v2.1..github/ISSUE_TEMPLATE/- bug-report and claim-challenge templates.
@software{lovell_nanoim,
title = {nanoIM: A Tiny Interaction-Model Lab for Temporal Aliasing},
author = {Jason Lovell},
year = {2026},
version = {0.1.6},
doi = {10.5281/zenodo.20492362},
publisher = {Zenodo},
license = {MIT},
url = {https://github.com/jlov7/nanoIM}
}See CITATION.cff for machine-readable citation metadata. Zenodo record: https://doi.org/10.5281/zenodo.20492362.
That is the whole artifact: a small, inspectable proof that interaction modeling needs representations that preserve time.
This repository is personal research and engineering work by Jason Lovell. It is not affiliated with, endorsed by, sponsored by, or representative of any current or former employer or organization. All views, code, claims, mistakes, and limitations are the author's own.