Skip to content

Context window management and compaction #96

@weeco

Description

@weeco

Problem Statement

The SDK has no awareness of context window limits during agent execution. When the conversation history grows beyond the model's context window, the agent fails with FinishReasonLength. There is no proactive management, warning, or mitigation.

Specific issues:

  1. No token counting before sending: The agent builds a request and sends it to the provider without checking whether the message history fits within ModelConstraints.MaxInputTokens. The constraint data exists but is never used proactively.

  2. No graceful degradation: When context is 99% full, the agent happily accepts a new task, sends the request, and fails. There's no early warning or preemptive action.

  3. No compaction strategy: When messages accumulate over a long conversation, there's no way to trim, summarize, or compress the history to stay within limits. The session just grows until it hits the wall.

  4. No per-agent control: Different agents in a multi-agent system may need different context management strategies (e.g., a research agent can afford to lose old context, but an audit agent needs full history).

Proposed Solution

Context window management could be implemented as a combination of a built-in TurnInterceptor plugin and agent-level configuration. This keeps the core agent loop simple while providing opt-in management.

Token estimation

Add a token estimation utility that can approximate token count for a message list:

// TokenEstimator estimates token counts for messages.
type TokenEstimator interface {
    EstimateTokens(messages []llm.Message) int
}

// DefaultEstimator uses a rough 4-chars-per-token heuristic.
// Provider-specific estimators can use tiktoken, etc.

Context management interceptor

A TurnInterceptor that checks context usage before each turn and applies a configured strategy:

ctxManager := contextmgmt.New(
    contextmgmt.WithMaxTokenRatio(0.85),      // Act when context is 85% full
    contextmgmt.WithStrategy(contextmgmt.SlidingWindow(20)), // Keep last 20 messages
    // OR: contextmgmt.WithStrategy(contextmgmt.Summarize(summaryModel))
    // OR: contextmgmt.WithStrategy(contextmgmt.DropOldest(keepSystem: true))
    // OR: contextmgmt.WithStrategy(customStrategyFunc)
)

agent, _ := llmagent.New("assistant", prompt, model,
    llmagent.WithInterceptors(ctxManager),
)

Strategies

  • Sliding window: Keep the N most recent messages (simplest, no LLM call needed)
  • Drop oldest: Remove oldest messages while keeping system prompt and recent context
  • Summarize: Use an LLM to summarize older messages into a condensed form, then replace them (most expensive but preserves information)
  • Custom: User-provided function that receives messages and token budget, returns trimmed messages

Proactive warning

Emit a StatusEvent when context usage exceeds a configurable threshold, giving consumers visibility:

StatusEvent{
    Stage:   StatusStageTurnStarted,
    Details: "context window at 87% capacity (35,200/40,000 tokens), compaction applied",
}

Use Case Example

Long-running support conversation:

// After 50+ exchanges, the conversation has grown to 35k tokens on a 40k model
// Without management: next turn fails with FinishReasonLength
// With management:

agent, _ := llmagent.New("support", prompt, model,
    llmagent.WithInterceptors(
        contextmgmt.New(
            contextmgmt.WithMaxTokenRatio(0.80),
            contextmgmt.WithStrategy(contextmgmt.SlidingWindow(30)),
        ),
    ),
)

// At turn 51, the interceptor detects context is at 87%
// It trims to the last 30 messages (keeping system prompt)
// Agent continues working normally
// StatusEvent emitted for observability

Summarization for high-value conversations:

contextmgmt.WithStrategy(contextmgmt.Summarize(cheapModel))
// When context exceeds threshold:
// 1. Takes messages 0..N-10 (old messages)
// 2. Sends them to cheapModel with "summarize this conversation"
// 3. Replaces old messages with a single summary message
// 4. Keeps last 10 messages intact for recency

Why This Matters

  • Production reliability: Long-running agent sessions (customer support, coding assistants, research tasks) will inevitably hit context limits. Failing with an error is the worst possible UX.
  • Cost efficiency: Summarization reduces token usage for subsequent turns, directly reducing API costs for long conversations.
  • Already have the data: ModelConstraints.MaxInputTokens is already populated per model. The infrastructure for proactive checking exists — it just needs to be wired up.
  • Plugin-friendly: Implementing this as a TurnInterceptor means it's opt-in, composable with other interceptors, and doesn't add complexity to the core agent loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agentAgent frameworkenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions