Skip to content

fix(markdown/telegram): stop one-letter Telegram messages under high HTML expansion ratios#39

Open
hah23255 wants to merge 2 commits intoop7418:mainfrom
hah23255:fix/telegram-chunk-min-size
Open

fix(markdown/telegram): stop one-letter Telegram messages under high HTML expansion ratios#39
hah23255 wants to merge 2 commits intoop7418:mainfrom
hah23255:fix/telegram-chunk-min-size

Conversation

@hah23255
Copy link
Copy Markdown

@hah23255 hah23255 commented May 4, 2026

Problem (reproducible)

When markdownToTelegramChunks() is called with markdown that renders to disproportionately large HTML — heavy nested inline formatting, many HTML escapes, links wrapped over long text — splitTelegramChunkByHtmlLimit recursively splits down to single-character chunks, each becoming its own Telegram message. From the user's side this looks like "messages break randomly, sometimes only one letter".

The included regression test (bridge-markdown-telegram-chunks.test.ts) reproduces the symptom on three pathological-but-plausible inputs.

Root cause

splitTelegramChunkByHtmlLimit computed proportionalLimit = (textLength × htmlLimit) / renderedHtmlLength. When the HTML expansion ratio is high, proportionalLimit collapses toward zero. Two paths to single-letter output:

  1. splitMarkdownIRPreserveWhitespace used fixed-stride slicing — N=4096 with limit=441 produced 9×441 + 1×127, the 127-char tail bypassed the previous 1-char early-return floor.
  2. Recursive splitting on chunks just above the floor — e.g. text=257 with splitLimit=256 → 256+1, the 1-char remainder was accepted by the outer-loop's chunk.text.length <= 1 accept-as-is branch.

Fix (two commits, both small)

545e6a5 — introduces an exported MIN_CHUNK_TEXT_LENGTH constant (256) and uses it as a basic floor.

bff8556 — closes the two paths above:

  1. splitMarkdownIRPreserveWhitespace: switch from fixed-stride to equal-split (K = ceil(N/limit) chunks of ceil(N/K) each). Every chunk is within ±1 of N/K, so when limit ≥ MIN, no chunk falls below MIN.
  2. splitTelegramChunkByHtmlLimit: refuse to split when N ≤ 2×MIN (such chunks unavoidably leave a sub-MIN tail). Outer loop's accept-as-is threshold raised to match.

Tests

  • bridge-markdown-telegram-chunks.test.ts (4 cases): exports a sensible MIN; never produces sub-MIN chunks when split happens; never produces <32-char chunks across 3 pathological inputs (heavy code-fence + escapes, deeply nested inline + links, pure HTML-escape soup); normal long-form docs still split correctly.
  • All existing tests still pass: 73/73 (tsc --noEmit clean).

Side note (filename glob)

The original new test was named markdown-telegram-chunks.test.ts — that filename silently doesn't match the bridge-*.test.ts glob in package.json:test:unit. Renamed to bridge-markdown-telegram-chunks.test.ts so it actually runs. The glob may want broadening to **/*.test.ts in a follow-up — flagged separately.

Bonus (separate concern, also in this PR)

Adds sanitizeModelName() in conversation-engine.ts (also exported, also tested). Strips trailing bracketed metadata from model names — the Claude Code CLI emits claude-opus-4-7[1m] on status SSE events to indicate the 1M-context tier; the [1m] was being stored verbatim and then passed back as --model next turn, which the CLI rejects. 6 tests cover Claude tiers, arbitrary providers, defensive null handling, and a trim of trailing whitespace.

If you'd prefer to keep this PR scoped strictly to the Telegram chunker, I'm happy to split the sanitiser into its own PR — let me know.

🤖 Generated with Claude Code

hah23255 and others added 2 commits May 3, 2026 20:38
…r messages

Symptom: Telegram bridge messages arrived fragmented into many tiny
messages, sometimes one character per message ("messages break randomly,
sometimes only one letter").

Root cause in src/lib/bridge/markdown/telegram.ts:splitTelegramChunkBy
HtmlLimit. The function computed a `proportionalLimit` of
(currentTextLength * htmlLimit) / renderedHtmlLength. When markdown
rendered to a much larger HTML payload than its source — heavy nested
formatting, many HTML escapes, links wrapped around long text — that
ratio drove the split limit toward zero. The recursive splitter then
produced 1-character MarkdownIR chunks; each 1-char chunk re-rendered
short HTML that fit the limit, was accepted, and became its own
TelegramChunk → its own Telegram message.

The 1-char early-return in splitTelegramChunkByHtmlLimit prevented
infinite recursion but didn't prevent the cascade above that point.

Fix:
- Export MIN_CHUNK_TEXT_LENGTH (256) constant for tests.
- splitTelegramChunkByHtmlLimit: early-return when text <= floor;
  candidateLimit and fallback are floored at MIN_CHUNK_TEXT_LENGTH.
- renderTelegramChunksWithinHtmlLimit: accept-as-is condition raised
  from `chunk.text.length <= 1` to `<= MIN_CHUNK_TEXT_LENGTH`. Oversized
  HTML on a small text chunk falls back to plain-text via the existing
  delivery-layer parse-error handler in sendWithRetry.

Tests:
- New src/__tests__/unit/markdown-telegram-chunks.test.ts (4 cases):
  exports the constant; never produces sub-floor chunks when split
  happens; never produces <32-char chunks across 3 pathological inputs
  (heavy code-fence + escapes, deeply nested inline + links, pure HTML-
  escape soup); normal long-form docs still split correctly.
- 63/63 unit tests pass; tsc --noEmit clean.

Operational:
- Already deployed via local promotion: fork dist/ → skill node_modules/
  claude-to-im/dist/ → esbuild rebundle of dist/daemon.mjs. Daemon at
  PID 104229 (started after rebuild) is running the fix. Zero chunk
  failures since restart.
- Independently fixed: 9 sessions in ~/.claude-to-im/data/sessions.json
  had model field "claude-opus-4-7[1m]" (CLI 1M-context tag the bridge
  auto-stored from status SSE events but cannot pass back as --model).
  Stripped via python+os.replace atomic rewrite. All 12 sessions
  preserved, no losses.

Followup (not this commit):
- Same fix should land upstream at op7418/claude-to-im so it survives
  skill reinstalls — currently the skill imports from there, not from
  this fork.
- The bridge should sanitise model names before storing them on
  status events (strip [..] suffix) so this can't recur.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…del sanitiser

Builds on 545e6a5. The earlier MIN_CHUNK_TEXT_LENGTH=256 floor was partial:

1. splitMarkdownIRPreserveWhitespace still emitted sub-MIN remainders
   (e.g., splitting 4096 chars at limit 441 produced 9×441 + 1×127, and the
   127-char tail surfaced as a tiny Telegram message).
2. splitTelegramChunkByHtmlLimit still recursed on chunks just above MIN
   (splitting 257 at 256 produced 256+1).
3. The new regression test silently wasn't running — its filename
   (markdown-telegram-chunks.test.ts) didn't match the bridge-*.test.ts
   glob in package.json:test:unit. Caught + fixed in this commit.

Two-part deeper fix:

- splitMarkdownIRPreserveWhitespace: switch from fixed-stride to equal-split
  (K = ceil(N/limit) chunks of ceil(N/K) each). Every chunk size lands within
  ±1 of N/K, so when limit ≥ MIN, no chunk falls below MIN.
- splitTelegramChunkByHtmlLimit: refuse to split when N ≤ 2×MIN. Such chunks
  unavoidably leave a sub-MIN tail; outer loop accepts them as-is and the
  delivery layer's HTML→plain fallback handles oversized HTML.
- Outer renderTelegramChunksWithinHtmlLimit: accept-as-is threshold raised
  from `text ≤ MIN` to `text ≤ 2×MIN` to align with the new splitter contract.

Plus persist-time model sanitiser (prevents "[1m]" data corruption recurring):

- New exported sanitizeModelName() in conversation-engine.ts strips trailing
  bracketed metadata (e.g., the `[1m]` 1M-context tier suffix the Claude
  Code CLI emits on its `status` SSE event). Without this, the bridge stored
  "claude-opus-4-7[1m]" verbatim and then passed it back as `--model` on
  the next turn, where the CLI rejected it and the LLM provider fell back
  to the default. Sanitiser runs on every status event before
  updateSessionModel.

Tests:
- bridge-markdown-telegram-chunks.test.ts (4 cases): exports a sensible MIN
  value; never produces sub-MIN chunks when split happens; never produces
  <32-char chunks across 3 pathological inputs (heavy code-fence + escapes,
  deeply nested inline + links, pure HTML-escape soup); normal long-form
  docs still split correctly. Renamed from markdown-telegram-chunks.test.ts
  so it matches the bridge-*.test.ts glob and actually runs.
- bridge-conversation-engine-sanitize.test.ts (6 cases): strip [1m] for
  Claude opus/sonnet, strip arbitrary [...] suffixes for any provider,
  leave clean names untouched, only strip trailing brackets, trim
  whitespace, defensive null/undefined handling.
- 73/73 unit tests pass; tsc --noEmit clean.

Operational:
- Already deployed via local promotion (fork dist → skill node_modules →
  esbuild rebundle of dist/daemon.mjs). Skill bundle has both new symbols
  (MIN_CHUNK_TEXT_LENGTH ×6, sanitizeModelName ×2 in the bundled output).
  Daemon at PID 148375 (started 06:51 BST after rebuild) is running the
  full fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 4, 2026 05:54
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a Telegram bridge chunking failure mode where extreme Markdown→HTML expansion caused recursive splitting down to tiny (sometimes 1-character) messages, and adds a small fix to prevent persisting vendor metadata in streamed model identifiers.

Changes:

  • Introduce MIN_CHUNK_TEXT_LENGTH and update Telegram render-first chunking/splitting logic to avoid tiny text chunks under high HTML expansion ratios.
  • Add regression coverage for pathological Markdown inputs and ensure the new test file matches the unit-test glob.
  • Add sanitizeModelName() and apply it to streamed status events before persisting the session model.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
src/lib/bridge/markdown/telegram.ts Adds a minimum chunk-text floor and revises split strategy to prevent runaway splitting into tiny Telegram messages.
src/lib/bridge/conversation-engine.ts Adds and uses sanitizeModelName() when persisting model names from SSE status events.
src/tests/unit/bridge-markdown-telegram-chunks.test.ts Adds regression tests for the Telegram chunker’s “tiny messages” failure mode.
src/tests/unit/bridge-conversation-engine-sanitize.test.ts Adds unit tests for sanitizeModelName() behavior across providers and edge cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +344 to 350
// Accept the chunk as-is if it fits OR if it's at-or-below 2×MIN
// (the splitter refuses to split below this threshold to avoid sub-MIN
// remainders that surfaced as one-letter Telegram messages). Oversized
// HTML on a small-text chunk is handled by the delivery layer's HTML→
// plain fallback when Telegram rejects it.
if (html.length <= normalizedLimit || chunk.text.length <= MIN_CHUNK_TEXT_LENGTH * 2) {
rendered.push({ html, text: chunk.text });
const normalizedLimit = Math.max(1, Math.floor(limit));
if (normalizedLimit <= 0 || ir.text.length <= normalizedLimit) {
const N = ir.text.length;
if (normalizedLimit <= 0 || N <= normalizedLimit) {
Comment on lines +36 to +37
export function sanitizeModelName(model: string): string {
if (typeof model !== 'string') return model;
Comment on lines 343 to 345
if (statusData.model) {
store.updateSessionModel(sessionId, statusData.model);
store.updateSessionModel(sessionId, sanitizeModelName(statusData.model));
}
Comment on lines +260 to +266
// Equal-split: divide into K = ceil(N / limit) chunks of ceil(N / K) chars
// each. Avoids the prior fixed-stride approach which left a small remainder
// (e.g., splitting 4096 at 441 → 9×441 + 1×127); the 127-char tail then
// bypassed the MIN_CHUNK_TEXT_LENGTH floor and surfaced as a tiny Telegram
// message. Equal-split keeps every chunk size within `±1` of N/K, so when
// the caller's `limit` itself is ≥ MIN we never produce sub-MIN remainders.
const K = Math.ceil(N / normalizedLimit);
@hah23255 hah23255 closed this May 4, 2026
@hah23255 hah23255 deleted the fix/telegram-chunk-min-size branch May 4, 2026 06:40
@hah23255 hah23255 restored the fix/telegram-chunk-min-size branch May 4, 2026 06:43
@hah23255 hah23255 reopened this May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants