Skip to content

jlov7/nanoIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

nanoIM

nanoIM temporal aliasing lab: same transcript, different timing, different actions

Chat collapses time. nanoIM restores it.
A small from-scratch interaction-model lab for the temporal-aliasing bottleneck.

Controlled symbolic evidence for a representation bottleneck. Trains from scratch, no API keys, fully offline.

License: MIT Python 3.11 | 3.12 PyTorch 2.12.0 Managed by uv 253 tests passing pyright clean on nanoim package verify_release pass 155 hashed artifacts CI DOI: 10.5281/zenodo.20492362

An animated illustration of one alias pair. Two micro-turn timelines side by side. Both produce the identical flattened transcript word by word. At the critical timestep the user_audio_state differs (Trace A hesitating, Trace B silent), and so the required next action differs (WAIT vs SPEAK).


What This Is

nanoIM is a compact research artifact for one question:

If two interactions flatten to the same final chat transcript but require different next actions, what can a transcript-only model know?

The answer is structural. A transcript-only model cannot separate the pair because its input is identical. A native micro-turn model can, if the missing variables are preserved: silence, timing, overlap, interruptions, visual cues, policy boundaries, and asynchronous tool events.

The repository is intentionally small enough to read like nanoGPT, but complete enough to generate data, train models from scratch, evaluate baselines, render proof galleries, and audit leakage.

Layer What is included Why it matters
Formal spine paper/temporal_aliasing.md, paper/the_interaction_bottleneck.md States the impossibility argument and experimental method.
Data lab symbolic 200 ms-style micro-turn traces across 10 task families Tests interaction states that disappear in chat transcripts.
Fair baselines transcript majority, transcript naive Bayes, transcript oracle upper bound, rule harness, field-lookup table Separates structural transcript failure from weak modeling.
Trainable models tiny GRU and tiny Transformer, stream/time embeddings, action head Proves the effect with from-scratch local models.
Anti-cheat gates alias equality tests, no transcript leakage, destructive controls Makes shortcut explanations harder.
Reviewer proof machine-readable scorecards, timeline gallery, honest limitations Gives skeptics direct artifacts to inspect.

The Thesis

Turn-based chat is a lossy serialization format. It turns interaction into alternating messages and erases the variables that often decide whether a system should wait, speak, yield, interrupt, request approval, resume after a tool result, or integrate a background event.

nanoIM studies that bottleneck as temporal aliasing:

flatten(trace_a) == flatten(trace_b)
target_action(trace_a) != target_action(trace_b)

Any policy that only receives flatten(trace) must assign the same action distribution to both traces. If the required actions differ, it must fail at least one member of the pair. That is the core claim, and the repo turns it into executable tests.

One-Minute Result

Held-out test split, hard suite:

System Temporal aliasing accuracy Paired separation Notes
Transcript majority 0.300 0.000 transcript only
Transcript naive Bayes 0.467 0.000 transcript only
Transcript oracle upper bound 0.500 0.000 structural ceiling for any transcript-only policy
Rule harness (stream-aware) 1.000 1.000 hand-written timing/stream rules
Field-lookup table (stream-aware) 1.000 1.000 memorized dict over stream fields
MicroTurn Tiny GRU 1.000 +/- 0.0000 1.000 +/- 0.0000 from scratch, seeds 3/7/11
MicroTurn Tiny Transformer 1.000 1.000 from scratch, seed 7

Held-out test split, noisy counterbalanced suite:

System Temporal aliasing accuracy Paired separation Notes
Transcript majority 0.400 0.000 transcript only
Transcript naive Bayes 0.4625 0.000 transcript only
Transcript oracle upper bound 0.500 0.000 structural ceiling for any transcript-only policy
Rule harness (stream-aware) 1.000 1.000 hand-written timing/stream rules
Field-lookup table (stream-aware) 1.000 1.000 memorized dict over stream fields
MicroTurn Tiny GRU 1.000 +/- 0.0000 1.000 +/- 0.0000 from scratch, seeds 3/7/11
No-timing ablation (GRU) 0.650 +/- 0.0000 0.300 +/- 0.0000 timing removed from the input

Bootstrap 95% confidence intervals on per-example correctness from the hard-suite test split (10,000 resamples, seed 7):

Method TAA point estimate TAA 95% CI
Transcript majority 0.300 [0.217, 0.383]
Tiny GRU (seed 7) 1.000 [1.000, 1.000]
Tiny Transformer (seed 7) 1.000 [1.000, 1.000]
Rule harness 1.000 [1.000, 1.000]

Paired permutation tests (10,000 resamples, seed 7) for each stream-aware method against transcript majority all return p = 0.0000, consistent with p < 1e-4. Reproduce with uv run python -m nanoim.experiments.statistical_tests; full output in reports/statistical_tests.json.

Read this as a statement about the representation, not the model. Transcript-only policies are provably capped at 0.500 on balanced alias pairs: the oracle can assign only one action per identical transcript, so it must miss one member of every pair (paired separation 0.000). Every stream-aware method I tried (rules, lookup table, GRU, Transformer) saturates at 1.000. nanoIM does not claim the neural model beats symbolic baselines; in this controlled lab it ties them, which is the point. The trained model is an existence proof that the signal is learnable from scratch. The contribution is the 0.500 → 1.000 gap from restoring the streams (state and timing) the transcript discards.

What "held-out test split" does and does not mean. Train/val/test use disjoint transcript wordings, so a transcript-only model cannot win by memorizing text templates. But the streams that decide the action are identical across splits; only the filler wording differs. The test numbers therefore measure within-distribution reproduction plus robustness to decoy noise, not generalization to novel interaction structure or unseen task families. nanoIM makes no generalization claim for the learned model beyond this controlled setup.

The field-lookup table is deliberately included as a strong baseline, not a strawman (reports/hard_field_lookup_scorecard.json, reports/noisy_field_lookup_scorecard.json; reproduce with uv run python -m nanoim.eval --baseline field_lookup ...). That a plain dict saturates the task once it sees the streams is the point: the bottleneck is the transcript representation, not model capacity.

Scatter plot of temporal aliasing accuracy vs paired separation rate. Transcript-only methods cluster at TAA <= 0.500 with paired separation 0.000. Stream-aware methods (rules, lookup table, GRU, Transformer) cluster at 1.000 / 1.000.

Transcript-only LLM baselines (the structural ceiling held by modern open-weights models)

The structural argument predicts that any deterministic transcript-only policy is capped at TAA 0.500 on balanced alias pairs with paired separation 0.000 under this construction, regardless of capacity. I tested that against seven modern open-weights LLMs spanning five families and 3B–30B parameters, each given the same flattened transcript and asked for the next action. The parser uses max_tokens=2048 and falls back to the reasoning field for models that emit chain-of-thought (Qwen3 and GLM-4.7 do); every prediction is sourced from a real model output, not from a default. Zero parse failures across 840 LLM calls.

Model Family Params Type TAA Paired separation n
Phi-4 14B Microsoft 14B dense 0.133 0.000 120
Llama 3.2 3B Meta 3B dense 0.208 0.000 120
Qwen3 4B Alibaba 4B dense 0.133 0.000 120
Qwen3 14B MLX Alibaba 14B dense 0.158 0.000 120
Qwen3 30B-A3B MLX Alibaba 30B (3B active) MoE 0.183 0.000 120
Gemma 4 26B Google 26B dense 0.208 0.000 120
GLM-4.7 Flash Zhipu AI 30B class reasoning (2026) 0.117 0.000 120

420 alias pairs evaluated, 420/420 with paired separation = 0.000. Every tested model, across family, parameter count, dense/MoE/reasoning architecture, and chain-of-thought behavior, assigns the same action to both members of every alias pair. The 19K-parameter from-scratch GRU that sees the streams reaches 1.000 / 1.000 on the same data. In this benchmark, the bottleneck is the representation, not the model. Per-model JSON in reports/llm_baseline_*.json; aggregate in reports/llm_baseline_summary.json. Reproduce with uv run python -m nanoim.experiments.llm_baseline --base-url http://127.0.0.1:11434/v1 --model qwen3:4b --out reports/llm_baseline_qwen3_4b.json (Ollama) or substitute the LM Studio endpoint and model name.

Input-field importance (occlusion at inference time)

Same trained checkpoints, no retraining; each input field is blanked at inference time. The drop in temporal aliasing accuracy measures how much each architecture actually consults each field at decision time. Both architectures agree on the ranking:

Field blanked GRU TAA drop Transformer TAA drop
user_audio_state -0.350 -0.400
visual_event -0.183 -0.200
background_event -0.100 -0.100
t_ms / dt_ms -0.050 -0.050
model_audio_state -0.050 -0.050
policy_event -0.050 -0.050

user_audio_state is the single most load-bearing field. That fits the data: most task families turn on whether the user is speaking, silent, hesitating, or interrupting at the critical timestep. This is different from the retraining no_X ablations (where the model can adapt around a missing field): the question here is what the trained model relies on at inference. Reproduce with uv run python -m nanoim.experiments.occlusion --out reports/occlusion_analysis.json. Full per-field, per-architecture detail in reports/occlusion_analysis.json.

Generalization (leave-one-family-out)

I separately tested whether each method generalizes to interaction structures it did not see during training. For each of the ten task families, I retrained a fresh tiny GRU on the other nine families and evaluated on the held-out family. The result is honest about which methods generalize and which do not:

Hard suite, mean across 10 held-out families × 3 seeds:

Method Mean LOFO TAA Mean LOFO paired separation
Rule harness (invariant) 1.000 1.000
Tiny GRU 0.344 0.156
Tiny Transformer 0.256 0.011
Field-lookup table 0.300 0.000
Transcript majority 0.300 n/a

Noisy suite, mean across 10 held-out families × 3 seeds:

Method Mean LOFO TAA Mean LOFO paired separation
Rule harness (invariant) 1.000 1.000
Tiny GRU 0.429 0.075
Tiny Transformer 0.317 0.033
Field-lookup table 0.400 0.000
Transcript majority 0.400 n/a

Per-family LOFO TAA across rule harness, tiny GRU, tiny Transformer, field-lookup, and transcript majority on the hard suite. The rule harness stays at 1.000 on every family; the trained models collapse to a wide range below the 0.50 oracle bound.

The rule harness is invariant by construction (its fit is a no-op), so it transfers trivially on every held-out family and the release verifier asserts this. The field-lookup table collapses to its global default action on stream keys it never saw. Both trained architectures underperform the transcript-majority baseline on the hard suite. The more expressive Transformer collapses further than the GRU on both suites, which is the right signal: more capacity means more family-specific overfitting, not more generalization. Source: reports/lofo_sweep.json and reports/lofo_sweep_noisy.json.

This is the kind of negative result that should ship with the positive one. The repo's claim is about the representation and the invariant rule harness, not about novel-structure generalization from the learned model. Reproduce with uv run python -m nanoim.experiments.lofo --out reports/lofo_sweep.json. Per-family numbers are in reports/lofo_sweep.json.

Proof Spine

flowchart LR
    A["Symbolic micro-turn generator"] --> B["Paired alias examples"]
    B --> C["Anti-leakage tests"]
    C --> D["Transcript baselines (oracle bound 0.50)"]
    C --> E["Rule harness / field-lookup"]
    C --> F["MicroTurn Tiny GRU / Transformer"]
    F --> G["Ablations and destructive controls"]
    D --> H["Machine-readable scorecards"]
    E --> H
    G --> H
    H --> I["Timeline gallery"]
Loading

Quickstart

Requires Python 3.11 or 3.12. Install uv first, then:

uv sync --dev
uv run python -m nanoim --help
uv run python -m nanoim.data.generate --suite mini --out data/mini.jsonl
uv run python -m nanoim.train --config configs/tiny.yaml
uv run python -m nanoim.eval --checkpoint runs/tiny/best.pt --suite mini --out reports/scorecard.json
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline.html
uv run pytest
uv build
uv run python -m nanoim.package_audit --out reports/package_audit.json

After activating .venv, the spec's bare commands work as written:

source .venv/bin/activate
python -m nanoim --help
python -m nanoim.data.generate --suite mini --out data/mini.jsonl
python -m nanoim.train --config configs/tiny.yaml
python -m nanoim.eval --checkpoint runs/tiny/best.pt --suite mini --out reports/scorecard.json
python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline.html
pytest

Start Here For Reviewers

If you have ten minutes and want to attack the claim, start here:

  1. docs/reviewer_guide.md - how to attack the claim using the committed artifacts.
  2. CLAIM_LEDGER.md - public claims tied to exact evidence, reviewer attacks, and non-claims.
  3. SOURCE_MAP.md - where every artifact, command, model, report, and test lives.
  4. reports/failure_report.md - preserved failure analysis and interpretation.
  5. reports/timeline_gallery.html - local visual proof gallery for same-transcript/different-timing pairs.

The commands below refresh committed artifact paths. Use --out /tmp/... for no-diff smoke runs.

Ten-Minute Command Path

If you are reviewing the claim skeptically, read these in order:

  1. paper/temporal_aliasing.md for the toy impossibility argument.
  2. tests/test_aliasing.py for the anti-leakage contract.
  3. reports/scorecard.md for the human-readable result summary.
  4. reports/noisy_sweep.json and reports/adversarial_controls.json for timing and shortcut checks.
  5. reports/timeline_gallery.html for side-by-side same-transcript/different-timing examples.
  6. reports/failure_report.md for known failures and interpretation.
  7. HOSTILE_REVIEW.md and AUTHOR_RESPONSE.md for the strongest objections and current answers.

Task Families

nanoIM ships at least 10 interaction families, each designed so a transcript can be identical while the correct micro-turn action differs:

Family Missing variable Example action contrast
yield detection silence duration and user audio state WAIT vs SPEAK
hesitation filled pauses and non-final prosody proxy WAIT vs BACKCHANNEL
barge-in recovery overlap and interruption onset STOP_SPEAKING vs WAIT
self-correction revision timing WAIT vs SPEAK
backchannel timing short acknowledgement placement BACKCHANNEL vs ASK_CLARIFICATION
clarification timing visual/text confidence cue WAIT vs ASK_CLARIFICATION
visual cue trigger non-verbal event stream ASK_CLARIFICATION vs SPEAK
approval interruption policy boundary during speech REQUEST_APPROVAL vs CALL_TOOL
async tool weaving background tool result timing CALL_TOOL vs WAIT
background result integration external result arrives mid-turn RESUME_WITH_RESULT vs SPEAK

Example Alias Pair

Both examples flatten to the same transcript:

User: book the 3 pm slot

But they require different actions:

Trace Preserved micro-turn state Correct action
A user is still hesitating, no approval cue WAIT
B user has yielded, approval cue is active REQUEST_APPROVAL

The transcript-only model sees identical input. The micro-turn model receives the streams that decide the action.

Architecture

flowchart TB
    subgraph Data["data layer"]
        G["nanoim.data.generate"]
        P["nanoim.data.pilot"]
        HP["nanoim.data.human_protocol"]
    end

    subgraph Features["representation"]
        TF["transcript-only features"]
        MF["micro-turn stream/time features"]
    end

    subgraph Policies["policies"]
        TM["transcript majority / NB / oracle"]
        RH["rule harness"]
        GRU["tiny GRU"]
        TX["tiny Transformer"]
    end

    subgraph Evidence["evidence"]
        EV["nanoim.eval"]
        SW["nanoim.experiments.sweep"]
        CT["nanoim.controls"]
        VZ["nanoim.viz.render"]
        VR["nanoim.verify_release"]
    end

    G --> TF
    G --> MF
    P --> MF
    HP --> MF
    TF --> TM
    MF --> RH
    MF --> GRU
    MF --> TX
    TM --> EV
    RH --> EV
    GRU --> EV
    TX --> EV
    EV --> SW
    EV --> CT
    EV --> VZ
    SW --> VR
    CT --> VR
Loading

Models

The default model is deliberately small:

  • stream embeddings for symbolic audio/text/visual/background/policy state;
  • time/delta-time embeddings for 200 ms-style micro-turns;
  • a compact GRU backbone with per-timestep action supervision;
  • an action head for both critical-decision and trajectory metrics;
  • optional tiny Transformer comparator with the same feature boundary.

Everything trains locally from scratch. No paid API keys, frontier APIs, hosted models, or hidden services are used.

Metrics

Scorecards are machine-readable JSON. The main release metrics include:

Metric Meaning
transcript_upper_bound best possible transcript-only ceiling under paired aliases
temporal_aliasing_accuracy critical-action accuracy on alias pairs
paired_separation_rate fraction of pairs where the policy separates both members correctly
action_accuracy ordinary critical-action accuracy
sequence_action_accuracy per-timestep trajectory accuracy
wait_vs_speak_accuracy separation of wait/speak timing decisions
false_interrupt_rate / missed_interrupt_rate interruption handling errors
barge_in_recovery_time recovery latency after barge-in
clarification_timing_score timing-sensitive clarification behavior
tool_weaving_score background/tool-event integration
turn_based_baseline_delta micro-turn gain over transcript-only baseline

Anti-Cheat Contract

nanoIM is only interesting if the transcript baseline is fair. The repo therefore checks:

  • alias-pair members share identical flattened transcripts;
  • transcript features exclude pair IDs, task families, labels, event labels, timestamps, split IDs, and timing annotations;
  • generated hard/noisy artifacts match deterministic generator output;
  • destructive controls collapse performance when timing/stream/label structure is broken;
  • public docs distinguish symbolic proof from real audio/video competence.

The relevant tests and checks are tests/test_aliasing.py, nanoim.verify_release (asserts the scientific contract — transcript oracle bound 0.50, paired separation 0.00, and large destructive-control drops), and nanoim.security_audit.

Ablations And Controls

Hard-suite multi-seed input ablations:

Variant Temporal aliasing accuracy Paired separation Main observed hit
Full 1.000 +/- 0.0000 1.000 +/- 0.0000 baseline
Small model 0.983 +/- 0.0236 0.967 +/- 0.0471 capacity halved; hard suite remains learnable
No audio 0.847 +/- 0.0039 0.694 +/- 0.0079 turn ownership / hesitation
No visual 0.942 +/- 0.0068 0.883 +/- 0.0136 clarification timing
No background 0.894 +/- 0.0039 0.789 +/- 0.0079 tool weaving
No policy 0.939 +/- 0.0079 0.878 +/- 0.0157 approval/tool boundary
No timing 0.950 +/- 0.0000 0.900 +/- 0.0000 hard suite mostly uses state streams

Noisy-suite multi-seed ablations:

Variant Temporal aliasing accuracy Paired separation Main observed hit
Full 1.000 +/- 0.0000 1.000 +/- 0.0000 baseline
No timing 0.650 +/- 0.0000 0.300 +/- 0.0000 timing-only alias pairs
No background 0.935 +/- 0.0078 0.871 +/- 0.0156 tool/result state
No policy 0.950 +/- 0.0000 0.900 +/- 0.0000 approval boundaries
No visual 0.944 +/- 0.0051 0.887 +/- 0.0102 visual cue trigger

Destructive controls on noisy seed 7:

Control TAA Drop from base
Base 1.0000 n/a
Scramble t_ms within each trace 0.4000 0.6000
Mismatch critical non-transcript streams across examples 0.3188 0.6813
Permute held-out target labels 0.1967 0.8033

Visual Proof

The visualizer writes a self-contained HTML gallery:

uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline_gallery.html

It shows paired examples side by side with:

  • the identical flattened transcript;
  • micro-turn rows for user audio, text deltas, model state, visual events, background events, and policy events;
  • target action vs micro-turn prediction;
  • transcript-baseline prediction;
  • highlighted row-level errors;
  • timing annotations such as 0.2s-0.4s (200ms).

Open reports/timeline_gallery.html after running the command.

Checks

These are correctness checks, not a self-administered quality score. Each recomputes from the data and committed artifacts:

Check Command What it asserts
Scientific contract nanoim.verify_release transcript oracle bound stays 0.50, paired separation 0.00, multi-seed/Transformer clear the oracle delta, and the destructive controls cause large drops
Calibration nanoim.calibration noisy checkpoint accuracy 1.000, ECE 0.0251, Brier 0.0074, NLL 0.0275
Security hygiene nanoim.security_audit no secrets, no leaked paths, no oversized tracked files
Package artifacts nanoim.package_audit wheel/sdist metadata, console script, license, required modules
Hugging Face bundle nanoim.hf_export + nanoim.hf_validate deterministic bundle, card/split/checksum/replay validation with explicit noisy-checkpoint replay thresholds
Reproducibility nanoim.repro sha256 of every canonical source and evidence artifact

What is intentionally not here: there is no consented natural human-event corpus, and no independent external review. Those are real-world steps the repo does not pretend to have completed. Hosted CI runs on every released tag and should be confirmed green before public citation.

Human Event Bridge

The release includes a future-facing protocol for collecting consented human event logs:

uv run python -m nanoim.data.human_protocol \
  --events data/human_events.jsonl \
  --examples data/human_examples.jsonl \
  --out reports/human_event_protocol_audit.json

This protocol validates schema, monotonic timing, consent metadata, redaction status, obvious identifier leakage, and, when candidate examples exist, the same-transcript/different-action alias-pair contract. Passing without a corpus is not real-world validation; it only proves the repo is ready to collect and audit a real corpus without changing the claim boundary.

Hugging Face Release Kit

nanoIM includes local Hub-ready package templates for a dataset repo, a model repo, and a static Space:

uv run python -m nanoim.hf_export --out dist/huggingface --manifest-out reports/huggingface_release_manifest.json
uv run python -m nanoim.hf_validate --bundle dist/huggingface --manifest reports/huggingface_release_manifest.json --out reports/huggingface_offline_validation.json
uv run python -m nanoim.release_bundle --out dist/release/nanoim-0.1.5 --manifest-out reports/release_bundle_manifest.json

dist/huggingface/dataset contains the JSONL data and dataset card. dist/huggingface/model contains checkpoints, configs, scorecards, calibration evidence, and a model card. dist/huggingface/space contains a static proof-gallery front door. The offline validator checks cards, repo IDs, split files, checksums, manifests, alias contracts, model-card coverage for every exported checkpoint, and exported noisy-checkpoint replay before upload. The replay gate is near-perfect rather than exact (TAA >= 0.98, paired separation >= 0.97) and is applied to the committed release checkpoint. Full CI separately retrains models for reproduction coverage, then restores the committed release-packet inputs before building the publication bundle. The export is local-only and does not upload credentials or call the Hugging Face API.

See docs/huggingface_release.md and docs/release_engineering.md for the pre-upload checklist, .hfignore hygiene, SBOM/checksum bundle, and provenance workflow.

Full Reproduction

The full reproduction path is intentionally explicit:

uv run python -m nanoim.data.generate --suite mini --out data/mini.jsonl
uv lock --check
uv build
uv run python -m nanoim.package_audit --out reports/package_audit.json
uv run python -m nanoim.data.generate --suite hard --out data/hard.jsonl
uv run python -m nanoim.data.generate --suite noisy --seed 7 --out data/noisy.jsonl
uv run python -m nanoim.data.pilot --from-events data/pilot_events.jsonl --events data/pilot_events.jsonl --out data/pilot.jsonl
uv run python -m nanoim.train --config configs/tiny.yaml
uv run python -m nanoim.eval --checkpoint runs/tiny/best.pt --suite mini --out reports/scorecard.json
uv run python -m nanoim.eval --baseline transcript --suite mini --out reports/transcript_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite mini --out reports/rule_scorecard.json
uv run python -m nanoim.train --config configs/hard.yaml
uv run python -m nanoim.eval --checkpoint runs/hard/full/seed_7/best.pt --suite hard --data data/hard.jsonl --out reports/hard_scorecard.json
uv run python -m nanoim.eval --baseline transcript --suite hard --data data/hard.jsonl --out reports/hard_transcript_scorecard.json
uv run python -m nanoim.eval --baseline transcript_nb --suite hard --data data/hard.jsonl --out reports/hard_transcript_nb_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite hard --data data/hard.jsonl --out reports/hard_rule_scorecard.json
uv run python -m nanoim.experiments.sweep --suite hard --seeds 3,7,11 --ablations 'full;small_model;no_audio;no_visual;no_background;no_policy;no_timing' --epochs 45 --data-seed 7 --out reports/hard_sweep.json
uv run python -m nanoim.train --config configs/transformer_hard.yaml
uv run python -m nanoim.eval --checkpoint runs/transformer/hard/seed_7/best.pt --suite hard --data data/hard.jsonl --out reports/transformer_hard_scorecard.json
uv run python -m nanoim.controls --data data/hard.jsonl --checkpoint runs/transformer/hard/seed_7/best.pt --suite hard --out reports/transformer_hard_controls.json
uv run python -m nanoim.train --config configs/noisy.yaml
uv run python -m nanoim.eval --checkpoint runs/noisy/full/seed_7/best.pt --suite noisy --data data/noisy.jsonl --out reports/noisy_scorecard.json
uv run python -m nanoim.calibration --checkpoint runs/noisy/full/seed_7/best.pt --suite noisy --data data/noisy.jsonl --out reports/calibration_audit.json
uv run python -m nanoim.eval --baseline transcript --suite noisy --data data/noisy.jsonl --out reports/noisy_transcript_scorecard.json
uv run python -m nanoim.eval --baseline transcript_nb --suite noisy --data data/noisy.jsonl --out reports/noisy_transcript_nb_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite noisy --data data/noisy.jsonl --out reports/noisy_rule_scorecard.json
uv run python -m nanoim.eval --baseline field_lookup --suite hard --data data/hard.jsonl --out reports/hard_field_lookup_scorecard.json
uv run python -m nanoim.eval --baseline field_lookup --suite noisy --data data/noisy.jsonl --out reports/noisy_field_lookup_scorecard.json
uv run python -m nanoim.eval --checkpoint runs/hard/full/seed_7/best.pt --suite pilot --data data/pilot.jsonl --eval-split all --out reports/pilot_scorecard.json
uv run python -m nanoim.eval --baseline rule --suite pilot --data data/pilot.jsonl --eval-split all --out reports/pilot_rule_scorecard.json
uv run python -m nanoim.experiments.sweep --suite noisy --seeds 3,7,11 --ablations 'full;no_timing;no_audio;no_visual;no_background;no_policy' --epochs 120 --data-seed 7 --out reports/noisy_sweep.json
uv run python -m nanoim.controls --data data/noisy.jsonl --checkpoint runs/noisy/full/seed_7/best.pt --suite noisy --out reports/adversarial_controls.json
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline.html
uv run python -m nanoim.viz.render --run reports/example_trace.jsonl --out reports/timeline_gallery.html
uv run python -m nanoim.verify_release --out reports/release_verification.json
uv run python -m nanoim.data.human_protocol --events data/human_events.jsonl --examples data/human_examples.jsonl --out reports/human_event_protocol_audit.json
uv run python -m nanoim.hf_export --out dist/huggingface --manifest-out reports/huggingface_release_manifest.json
uv run python -m nanoim.hf_validate --bundle dist/huggingface --manifest reports/huggingface_release_manifest.json --out reports/huggingface_offline_validation.json
uv run python -m nanoim.security_audit --out reports/security_audit.json
uv run python -m nanoim.repro --out reports/reproducibility.json
uv run python -m nanoim.release_bundle --out dist/release/nanoim-0.1.5 --manifest-out reports/release_bundle_manifest.json

Repository Map

Path Role
nanoim/data/generate.py deterministic symbolic suite generator
nanoim/data/pilot.py raw event-log to micro-turn importer
nanoim/data/human_protocol.py future consented human-event protocol and alias-contract validator
nanoim/features.py transcript and micro-turn feature boundaries
nanoim/model.py tiny GRU and Transformer models
nanoim/train.py from-scratch training loop
nanoim/eval.py scorecard evaluator
nanoim/calibration.py checkpoint confidence-calibration audit
nanoim/controls.py destructive controls
nanoim/viz/render.py timeline HTML renderer
nanoim/baselines.py transcript majority/NB, rule harness, field-lookup table
nanoim/verify_release.py asserts the scientific result contract
nanoim/security_audit.py local secret/path/size hygiene scan
nanoim/hf_export.py Hugging Face dataset/model/Space export
nanoim/hf_validate.py offline Hugging Face bundle validator
nanoim/package_audit.py wheel/sdist artifact auditor
nanoim/release_bundle.py checksummed release bundle and SBOM
nanoim/repro.py sha256 manifest of canonical artifacts
SOURCE_MAP.md reviewer map from claims to files
CLAIM_LEDGER.md public claims, evidence, non-claims

Scope Boundaries

  • The substrate is symbolic, not real audio/video.
  • The pilot importer uses scripted local events with measured timing; it proves an ingestion path, not natural audio/video competence.
  • The result shows representation value in a controlled lab, not broad conversational intelligence.
  • The hard suite primarily tests state-stream aliasing; the noisy suite adds counterbalanced timing-only cases and is the timing-ablation evidence.
  • The human-event protocol is ready, but no consented natural human-event corpus is included.
  • Public claims should say "controlled symbolic evidence for a representation bottleneck," not "solved interaction modeling."
  • The core release path is offline: no API keys, no hosted-model dependency. The only module that makes outbound HTTP calls is the optional nanoim.experiments.llm_baseline, which talks to a user-supplied OpenAI-compatible endpoint (Ollama at 127.0.0.1:11434 or LM Studio at 127.0.0.1:1234 by default) and is not invoked by any release gate.

Project files

Citation

@software{lovell_nanoim,
  title     = {nanoIM: A Tiny Interaction-Model Lab for Temporal Aliasing},
  author    = {Jason Lovell},
  year      = {2026},
  version   = {0.1.6},
  doi       = {10.5281/zenodo.20492362},
  publisher = {Zenodo},
  license   = {MIT},
  url       = {https://github.com/jlov7/nanoIM}
}

See CITATION.cff for machine-readable citation metadata. Zenodo record: https://doi.org/10.5281/zenodo.20492362.

Launch Line

That is the whole artifact: a small, inspectable proof that interaction modeling needs representations that preserve time.

Personal Work Disclaimer

This repository is personal research and engineering work by Jason Lovell. It is not affiliated with, endorsed by, sponsored by, or representative of any current or former employer or organization. All views, code, claims, mistakes, and limitations are the author's own.

About

Chat collapses time. nanoIM restores it. A small from-scratch interaction-model lab demonstrating that the chat-transcript interface is structurally insufficient to represent real-time interaction decisions.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors