Skip to content

Latest commit

 

History

History
59 lines (30 loc) · 6.94 KB

File metadata and controls

59 lines (30 loc) · 6.94 KB

Frequently asked

Questions a curious reader actually asks before they decide whether to dig in. The harder formal objections are in HOSTILE_REVIEW.md and AUTHOR_RESPONSE.md. The claims-to-evidence map is CLAIM_LEDGER.md.

What does this prove, exactly?

That a flattened chat transcript is structurally insufficient to represent the decisions a real-time assistant has to make. If two interactions produce the same final transcript but require different next actions, no transcript-only policy can separate them. The repo gives a formal proof on balanced alias pairs, ten symbolic task families that produce such pairs, and an empirical demonstration that every method seeing the underlying streams clears the structural ceiling while every transcript-only method (including seven modern open-weights LLMs) sits at or below it.

Why a tiny from-scratch model? Why not just use an LLM?

The point isn't model capacity, it's representation. Seven modern LLMs (Phi-4 14B, Llama 3.2 3B, Qwen3 4B/14B/30B-A3B, Gemma 4 26B, GLM-4.7 Flash) were fed the same flattened transcript and asked for the next action. Every single one hit paired separation 0.000. The 19K-parameter from-scratch GRU that gets the streams reaches 1.000 / 1.000. Scaling the model doesn't fix a representation that has thrown the deciding information away.

Why symbolic data instead of real audio?

So the claim can actually be proven. With symbolic streams, alias pairs are constructed by hand: same transcript, different decisive variables. The structural argument is independent of whether the streams are symbolic or sampled from real audio, and the paper makes that explicit. A natural-audio companion is documented as future work (docs/human_event_corpus_protocol.md) and not yet exercised; the scripted local-event pilot demonstrates ingestion shape only.

Is this just a contrived setup?

The construction is deliberate, not contrived. The ten task families are the kinds of decisions a streaming voice assistant actually has to make: yield to the user, backchannel, stop speaking when interrupted, request approval before a tool call, resume after a background result. The structural argument is independent of the family list. The harder version is whether real-world transcripts have the same pathology at scale; nanoIM does not measure that yet. The LLM baselines only show that current open-weights text models do not recover information that this controlled transcript representation has discarded.

What's actually new?

A reproducible from-scratch demonstration that frames the chat interface itself as a representation bottleneck, with the negative result (the trained model doesn't generalize across families) shipped alongside the positive one. The related-work line in RELATED_WORK_MAP.md lists what the frontier (Moshi, Apple's Talking Turns, Full-Duplex-Bench, Thinking Machines' interaction-model post, Qwen2.5-Omni) is doing. nanoIM is narrower: a controlled symbolic existence proof rather than a real-time multimodal system.

Did you really train every model from scratch?

Yes. No pretrained weights, no LoRA adapters, no API distillation. The tiny GRU is 19K parameters at the mini suite, 65K at hard, 115K at noisy. The tiny Transformer is 258K parameters. Configs in configs/, training entry point in nanoim/train.py, vocab built per-suite from the training split. The reproducibility manifest hashes every checkpoint.

How long does this take to train?

Seconds per checkpoint on an Apple M4 Max with MPS. The hard suite full-model checkpoint trains in roughly six seconds for the configuration in configs/hard.yaml. The full multi-seed LOFO sweep across all ten held-out families and three seeds takes about six minutes for the GRU and a similar window for the Transformer.

Does the trained model generalize?

Across seeds within distribution: yes, very robustly (paired separation 1.000 ± 0.0000 across seeds 3, 7, and 11). Across held-out task families: no, and we report that openly. The leave-one-family-out sweep shows the trained GRU mean TAA at 0.344 and the trained Transformer mean TAA at 0.256, both below the transcript-majority baseline of 0.300. The invariant rule harness is the baseline that transfers; the trained model overfits to in-family patterns. That's the kind of negative result that should ship with the positive one.

What field does the model actually rely on?

user_audio_state, by a clear margin. The inference-time occlusion analysis (same trained weights, fields blanked at evaluation) shows TAA drops of -0.350 for the GRU and -0.400 for the Transformer when user_audio_state is removed. Both architectures agree on the field-importance ranking: audio > visual > background > timing ≈ model_state ≈ policy. Source: reports/occlusion_analysis.json.

Is the 0.50 → 1.00 gap actually statistically significant?

Yes. Paired permutation tests on per-example correctness across 10,000 resamples return p = 0.0000 (consistent with p < 1e-4) for every stream-aware method against the transcript-majority baseline. Bootstrap 95% confidence intervals: transcript majority [0.217, 0.383], stream-aware methods all [1.000, 1.000]. Source: reports/statistical_tests.json.

Why should I trust the destructive controls?

Because they're designed to break the model's edge and they do. Scrambling timing drops noisy TAA by 0.600. Mismatching critical streams drops it by 0.681. Permuting held-out labels drops it by 0.803. The release verifier asserts every one of these drops and fails if any falls below threshold. Source: reports/adversarial_controls.json (noisy GRU) and reports/transformer_hard_controls.json.

What's intentionally not here?

A consented natural human-event corpus, hosted CI on the published commit until the first public push lands, published Hugging Face URLs until the dataset and model repos go live, and independent external review. These gaps are stated openly in reports/failure_report.md and RESULTS.md. The work is small on purpose; the structural argument doesn't need scale, it needs to be inspectable.

Is this peer-reviewed?

No. It's a release of a reproducible artifact, not a publication. The paper-style writeup in paper/the_interaction_bottleneck.md is the formal version of the argument; the README and RESULTS.md are the operational summary. Counter-evidence and critique are explicitly welcomed; the CONTRIBUTING.md lists the highest-value contributions in priority order.

How can I argue with it?

Two ways. Either find a leak (a transcript feature that encodes more than text, a destructive control that does not drop performance the way it should, a baseline that beats the rule harness on the LOFO sweep) and file a claim_challenge issue with a reproducer; or extend the evidence (a new task family, an audio companion, a stronger LLM baseline, a tighter formal argument) and open a pull request. Templates for both are in .github/.