fix(markdown/telegram): stop one-letter Telegram messages under high HTML expansion ratios#39
Open
hah23255 wants to merge 2 commits intoop7418:mainfrom
Open
fix(markdown/telegram): stop one-letter Telegram messages under high HTML expansion ratios#39hah23255 wants to merge 2 commits intoop7418:mainfrom
hah23255 wants to merge 2 commits intoop7418:mainfrom
Conversation
…r messages
Symptom: Telegram bridge messages arrived fragmented into many tiny
messages, sometimes one character per message ("messages break randomly,
sometimes only one letter").
Root cause in src/lib/bridge/markdown/telegram.ts:splitTelegramChunkBy
HtmlLimit. The function computed a `proportionalLimit` of
(currentTextLength * htmlLimit) / renderedHtmlLength. When markdown
rendered to a much larger HTML payload than its source — heavy nested
formatting, many HTML escapes, links wrapped around long text — that
ratio drove the split limit toward zero. The recursive splitter then
produced 1-character MarkdownIR chunks; each 1-char chunk re-rendered
short HTML that fit the limit, was accepted, and became its own
TelegramChunk → its own Telegram message.
The 1-char early-return in splitTelegramChunkByHtmlLimit prevented
infinite recursion but didn't prevent the cascade above that point.
Fix:
- Export MIN_CHUNK_TEXT_LENGTH (256) constant for tests.
- splitTelegramChunkByHtmlLimit: early-return when text <= floor;
candidateLimit and fallback are floored at MIN_CHUNK_TEXT_LENGTH.
- renderTelegramChunksWithinHtmlLimit: accept-as-is condition raised
from `chunk.text.length <= 1` to `<= MIN_CHUNK_TEXT_LENGTH`. Oversized
HTML on a small text chunk falls back to plain-text via the existing
delivery-layer parse-error handler in sendWithRetry.
Tests:
- New src/__tests__/unit/markdown-telegram-chunks.test.ts (4 cases):
exports the constant; never produces sub-floor chunks when split
happens; never produces <32-char chunks across 3 pathological inputs
(heavy code-fence + escapes, deeply nested inline + links, pure HTML-
escape soup); normal long-form docs still split correctly.
- 63/63 unit tests pass; tsc --noEmit clean.
Operational:
- Already deployed via local promotion: fork dist/ → skill node_modules/
claude-to-im/dist/ → esbuild rebundle of dist/daemon.mjs. Daemon at
PID 104229 (started after rebuild) is running the fix. Zero chunk
failures since restart.
- Independently fixed: 9 sessions in ~/.claude-to-im/data/sessions.json
had model field "claude-opus-4-7[1m]" (CLI 1M-context tag the bridge
auto-stored from status SSE events but cannot pass back as --model).
Stripped via python+os.replace atomic rewrite. All 12 sessions
preserved, no losses.
Followup (not this commit):
- Same fix should land upstream at op7418/claude-to-im so it survives
skill reinstalls — currently the skill imports from there, not from
this fork.
- The bridge should sanitise model names before storing them on
status events (strip [..] suffix) so this can't recur.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…del sanitiser Builds on 545e6a5. The earlier MIN_CHUNK_TEXT_LENGTH=256 floor was partial: 1. splitMarkdownIRPreserveWhitespace still emitted sub-MIN remainders (e.g., splitting 4096 chars at limit 441 produced 9×441 + 1×127, and the 127-char tail surfaced as a tiny Telegram message). 2. splitTelegramChunkByHtmlLimit still recursed on chunks just above MIN (splitting 257 at 256 produced 256+1). 3. The new regression test silently wasn't running — its filename (markdown-telegram-chunks.test.ts) didn't match the bridge-*.test.ts glob in package.json:test:unit. Caught + fixed in this commit. Two-part deeper fix: - splitMarkdownIRPreserveWhitespace: switch from fixed-stride to equal-split (K = ceil(N/limit) chunks of ceil(N/K) each). Every chunk size lands within ±1 of N/K, so when limit ≥ MIN, no chunk falls below MIN. - splitTelegramChunkByHtmlLimit: refuse to split when N ≤ 2×MIN. Such chunks unavoidably leave a sub-MIN tail; outer loop accepts them as-is and the delivery layer's HTML→plain fallback handles oversized HTML. - Outer renderTelegramChunksWithinHtmlLimit: accept-as-is threshold raised from `text ≤ MIN` to `text ≤ 2×MIN` to align with the new splitter contract. Plus persist-time model sanitiser (prevents "[1m]" data corruption recurring): - New exported sanitizeModelName() in conversation-engine.ts strips trailing bracketed metadata (e.g., the `[1m]` 1M-context tier suffix the Claude Code CLI emits on its `status` SSE event). Without this, the bridge stored "claude-opus-4-7[1m]" verbatim and then passed it back as `--model` on the next turn, where the CLI rejected it and the LLM provider fell back to the default. Sanitiser runs on every status event before updateSessionModel. Tests: - bridge-markdown-telegram-chunks.test.ts (4 cases): exports a sensible MIN value; never produces sub-MIN chunks when split happens; never produces <32-char chunks across 3 pathological inputs (heavy code-fence + escapes, deeply nested inline + links, pure HTML-escape soup); normal long-form docs still split correctly. Renamed from markdown-telegram-chunks.test.ts so it matches the bridge-*.test.ts glob and actually runs. - bridge-conversation-engine-sanitize.test.ts (6 cases): strip [1m] for Claude opus/sonnet, strip arbitrary [...] suffixes for any provider, leave clean names untouched, only strip trailing brackets, trim whitespace, defensive null/undefined handling. - 73/73 unit tests pass; tsc --noEmit clean. Operational: - Already deployed via local promotion (fork dist → skill node_modules → esbuild rebundle of dist/daemon.mjs). Skill bundle has both new symbols (MIN_CHUNK_TEXT_LENGTH ×6, sanitizeModelName ×2 in the bundled output). Daemon at PID 148375 (started 06:51 BST after rebuild) is running the full fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a Telegram bridge chunking failure mode where extreme Markdown→HTML expansion caused recursive splitting down to tiny (sometimes 1-character) messages, and adds a small fix to prevent persisting vendor metadata in streamed model identifiers.
Changes:
- Introduce
MIN_CHUNK_TEXT_LENGTHand update Telegram render-first chunking/splitting logic to avoid tiny text chunks under high HTML expansion ratios. - Add regression coverage for pathological Markdown inputs and ensure the new test file matches the unit-test glob.
- Add
sanitizeModelName()and apply it to streamedstatusevents before persisting the session model.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| src/lib/bridge/markdown/telegram.ts | Adds a minimum chunk-text floor and revises split strategy to prevent runaway splitting into tiny Telegram messages. |
| src/lib/bridge/conversation-engine.ts | Adds and uses sanitizeModelName() when persisting model names from SSE status events. |
| src/tests/unit/bridge-markdown-telegram-chunks.test.ts | Adds regression tests for the Telegram chunker’s “tiny messages” failure mode. |
| src/tests/unit/bridge-conversation-engine-sanitize.test.ts | Adds unit tests for sanitizeModelName() behavior across providers and edge cases. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+344
to
350
| // Accept the chunk as-is if it fits OR if it's at-or-below 2×MIN | ||
| // (the splitter refuses to split below this threshold to avoid sub-MIN | ||
| // remainders that surfaced as one-letter Telegram messages). Oversized | ||
| // HTML on a small-text chunk is handled by the delivery layer's HTML→ | ||
| // plain fallback when Telegram rejects it. | ||
| if (html.length <= normalizedLimit || chunk.text.length <= MIN_CHUNK_TEXT_LENGTH * 2) { | ||
| rendered.push({ html, text: chunk.text }); |
| const normalizedLimit = Math.max(1, Math.floor(limit)); | ||
| if (normalizedLimit <= 0 || ir.text.length <= normalizedLimit) { | ||
| const N = ir.text.length; | ||
| if (normalizedLimit <= 0 || N <= normalizedLimit) { |
Comment on lines
+36
to
+37
| export function sanitizeModelName(model: string): string { | ||
| if (typeof model !== 'string') return model; |
Comment on lines
343
to
345
| if (statusData.model) { | ||
| store.updateSessionModel(sessionId, statusData.model); | ||
| store.updateSessionModel(sessionId, sanitizeModelName(statusData.model)); | ||
| } |
Comment on lines
+260
to
+266
| // Equal-split: divide into K = ceil(N / limit) chunks of ceil(N / K) chars | ||
| // each. Avoids the prior fixed-stride approach which left a small remainder | ||
| // (e.g., splitting 4096 at 441 → 9×441 + 1×127); the 127-char tail then | ||
| // bypassed the MIN_CHUNK_TEXT_LENGTH floor and surfaced as a tiny Telegram | ||
| // message. Equal-split keeps every chunk size within `±1` of N/K, so when | ||
| // the caller's `limit` itself is ≥ MIN we never produce sub-MIN remainders. | ||
| const K = Math.ceil(N / normalizedLimit); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem (reproducible)
When
markdownToTelegramChunks()is called with markdown that renders to disproportionately large HTML — heavy nested inline formatting, many HTML escapes, links wrapped over long text —splitTelegramChunkByHtmlLimitrecursively splits down to single-character chunks, each becoming its own Telegram message. From the user's side this looks like "messages break randomly, sometimes only one letter".The included regression test (
bridge-markdown-telegram-chunks.test.ts) reproduces the symptom on three pathological-but-plausible inputs.Root cause
splitTelegramChunkByHtmlLimitcomputedproportionalLimit = (textLength × htmlLimit) / renderedHtmlLength. When the HTML expansion ratio is high,proportionalLimitcollapses toward zero. Two paths to single-letter output:splitMarkdownIRPreserveWhitespaceused fixed-stride slicing — N=4096 with limit=441 produced 9×441 + 1×127, the 127-char tail bypassed the previous 1-char early-return floor.chunk.text.length <= 1accept-as-is branch.Fix (two commits, both small)
545e6a5— introduces an exportedMIN_CHUNK_TEXT_LENGTHconstant (256) and uses it as a basic floor.bff8556— closes the two paths above:splitMarkdownIRPreserveWhitespace: switch from fixed-stride to equal-split (K = ceil(N/limit)chunks ofceil(N/K)each). Every chunk is within ±1 of N/K, so whenlimit ≥ MIN, no chunk falls below MIN.splitTelegramChunkByHtmlLimit: refuse to split whenN ≤ 2×MIN(such chunks unavoidably leave a sub-MIN tail). Outer loop's accept-as-is threshold raised to match.Tests
bridge-markdown-telegram-chunks.test.ts(4 cases): exports a sensible MIN; never produces sub-MIN chunks when split happens; never produces <32-char chunks across 3 pathological inputs (heavy code-fence + escapes, deeply nested inline + links, pure HTML-escape soup); normal long-form docs still split correctly.tsc --noEmitclean).Side note (filename glob)
The original new test was named
markdown-telegram-chunks.test.ts— that filename silently doesn't match thebridge-*.test.tsglob inpackage.json:test:unit. Renamed tobridge-markdown-telegram-chunks.test.tsso it actually runs. The glob may want broadening to**/*.test.tsin a follow-up — flagged separately.Bonus (separate concern, also in this PR)
Adds
sanitizeModelName()inconversation-engine.ts(also exported, also tested). Strips trailing bracketed metadata from model names — the Claude Code CLI emitsclaude-opus-4-7[1m]onstatusSSE events to indicate the 1M-context tier; the[1m]was being stored verbatim and then passed back as--modelnext turn, which the CLI rejects. 6 tests cover Claude tiers, arbitrary providers, defensive null handling, and a trim of trailing whitespace.If you'd prefer to keep this PR scoped strictly to the Telegram chunker, I'm happy to split the sanitiser into its own PR — let me know.
🤖 Generated with Claude Code