Skip to content

No chunking for markdown files — entire document sent as single LLM message, fails on context overflow #73

Description

@jr2804

Problem

OpenKB sends the entire source document as one LLM message when compiling markdown files. There is no splitting, truncation, or chunking strategy for .md sources.

In agent/compiler.py, compile_short_doc() injects the full document text into a single prompt:

content = source_path.read_text(encoding="utf-8")
doc_msg = {"role": "user", "content": _cached_text(_SUMMARY_USER.format(
    doc_name=doc_name, content=content,
))}

The long-document path (index_long_document() via PageIndex) only triggers for PDFs with page count ≥ pageindex_threshold (default 20). Markdown files have no such fallback — they always take the short-doc code path regardless of size.

Impact: Unusable with local/private LLMs

The README and defaults suggest models like gpt-5.4-mini, implying cloud-scale models with 1M+ token context windows. But for private knowledge bases — the stated use case — users often need to run local models with context windows of 4K–32K tokens.

With a 32K context model, any markdown file exceeding ~24K tokens (roughly 96K characters / ~30 pages) will:

  • Fail outright with a context-length-exceeded API error, or
  • Produce truncated/garbled output when the LLM silently drops the tail of the prompt

For reference, our corpus of 3GPP technical specification documents includes converted markdown files ranging from a few KB to 14 MB. We had 3 documents that could never be ingested because they exceed any reasonable context window. Even mid-sized documents (~50 pages) are risky with 8K–16K context models.

What a fix could look like

A chunking strategy for markdown (and other text-based formats) similar to what other wiki frameworks implement:

  1. Heading-aware splitting: Split on #/##/### boundaries so chunks respect document structure
  2. Token-aware sizing: Estimate token count per chunk and split when exceeding a configurable threshold (e.g., 75% of model context)
  3. Hierarchical synthesis: Summarize each chunk individually, then synthesize chunk summaries into a final document summary
  4. Graceful degradation: At minimum, truncate with a [...truncated at N tokens...] marker instead of sending an oversized prompt that will fail

The existing pageindex_threshold config key could be extended to apply to markdown files (e.g., character or token count threshold), or a new config key could be introduced.

Environment

  • OpenKB version: latest (pip)
  • Model: deepseek-v4-flash via Ollama cloud (128K context, but reasoning tokens consume significant budget)
  • Document corpus: 475 markdown files converted from 3GPP ATIAS specifications
  • 3 documents too large for any practical context window (3–14 MB)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions