Questions a curious reader actually asks before they decide whether to dig in. The harder formal objections are in HOSTILE_REVIEW.md and AUTHOR_RESPONSE.md. The claims-to-evidence map is CLAIM_LEDGER.md.
That a flattened chat transcript is structurally insufficient to represent the decisions a real-time assistant has to make. If two interactions produce the same final transcript but require different next actions, no transcript-only policy can separate them. The repo gives a formal proof on balanced alias pairs, ten symbolic task families that produce such pairs, and an empirical demonstration that every method seeing the underlying streams clears the structural ceiling while every transcript-only method (including seven modern open-weights LLMs) sits at or below it.
The point isn't model capacity, it's representation. Seven modern LLMs (Phi-4 14B, Llama 3.2 3B, Qwen3 4B/14B/30B-A3B, Gemma 4 26B, GLM-4.7 Flash) were fed the same flattened transcript and asked for the next action. Every single one hit paired separation 0.000. The 19K-parameter from-scratch GRU that gets the streams reaches 1.000 / 1.000. Scaling the model doesn't fix a representation that has thrown the deciding information away.
So the claim can actually be proven. With symbolic streams, alias pairs are constructed by hand: same transcript, different decisive variables. The structural argument is independent of whether the streams are symbolic or sampled from real audio, and the paper makes that explicit. A natural-audio companion is documented as future work (docs/human_event_corpus_protocol.md) and not yet exercised; the scripted local-event pilot demonstrates ingestion shape only.
The construction is deliberate, not contrived. The ten task families are the kinds of decisions a streaming voice assistant actually has to make: yield to the user, backchannel, stop speaking when interrupted, request approval before a tool call, resume after a background result. The structural argument is independent of the family list. The harder version is whether real-world transcripts have the same pathology at scale; nanoIM does not measure that yet. The LLM baselines only show that current open-weights text models do not recover information that this controlled transcript representation has discarded.
A reproducible from-scratch demonstration that frames the chat interface itself as a representation bottleneck, with the negative result (the trained model doesn't generalize across families) shipped alongside the positive one. The related-work line in RELATED_WORK_MAP.md lists what the frontier (Moshi, Apple's Talking Turns, Full-Duplex-Bench, Thinking Machines' interaction-model post, Qwen2.5-Omni) is doing. nanoIM is narrower: a controlled symbolic existence proof rather than a real-time multimodal system.
Yes. No pretrained weights, no LoRA adapters, no API distillation. The tiny GRU is 19K parameters at the mini suite, 65K at hard, 115K at noisy. The tiny Transformer is 258K parameters. Configs in configs/, training entry point in nanoim/train.py, vocab built per-suite from the training split. The reproducibility manifest hashes every checkpoint.
Seconds per checkpoint on an Apple M4 Max with MPS. The hard suite full-model checkpoint trains in roughly six seconds for the configuration in configs/hard.yaml. The full multi-seed LOFO sweep across all ten held-out families and three seeds takes about six minutes for the GRU and a similar window for the Transformer.
Across seeds within distribution: yes, very robustly (paired separation 1.000 ± 0.0000 across seeds 3, 7, and 11). Across held-out task families: no, and we report that openly. The leave-one-family-out sweep shows the trained GRU mean TAA at 0.344 and the trained Transformer mean TAA at 0.256, both below the transcript-majority baseline of 0.300. The invariant rule harness is the baseline that transfers; the trained model overfits to in-family patterns. That's the kind of negative result that should ship with the positive one.
user_audio_state, by a clear margin. The inference-time occlusion analysis (same trained weights, fields blanked at evaluation) shows TAA drops of -0.350 for the GRU and -0.400 for the Transformer when user_audio_state is removed. Both architectures agree on the field-importance ranking: audio > visual > background > timing ≈ model_state ≈ policy. Source: reports/occlusion_analysis.json.
Yes. Paired permutation tests on per-example correctness across 10,000 resamples return p = 0.0000 (consistent with p < 1e-4) for every stream-aware method against the transcript-majority baseline. Bootstrap 95% confidence intervals: transcript majority [0.217, 0.383], stream-aware methods all [1.000, 1.000]. Source: reports/statistical_tests.json.
Because they're designed to break the model's edge and they do. Scrambling timing drops noisy TAA by 0.600. Mismatching critical streams drops it by 0.681. Permuting held-out labels drops it by 0.803. The release verifier asserts every one of these drops and fails if any falls below threshold. Source: reports/adversarial_controls.json (noisy GRU) and reports/transformer_hard_controls.json.
A consented natural human-event corpus, hosted CI on the published commit until the first public push lands, published Hugging Face URLs until the dataset and model repos go live, and independent external review. These gaps are stated openly in reports/failure_report.md and RESULTS.md. The work is small on purpose; the structural argument doesn't need scale, it needs to be inspectable.
No. It's a release of a reproducible artifact, not a publication. The paper-style writeup in paper/the_interaction_bottleneck.md is the formal version of the argument; the README and RESULTS.md are the operational summary. Counter-evidence and critique are explicitly welcomed; the CONTRIBUTING.md lists the highest-value contributions in priority order.
Two ways. Either find a leak (a transcript feature that encodes more than text, a destructive control that does not drop performance the way it should, a baseline that beats the rule harness on the LOFO sweep) and file a claim_challenge issue with a reproducer; or extend the evidence (a new task family, an audio companion, a stronger LLM baseline, a tighter formal argument) and open a pull request. Templates for both are in .github/.