Skip to content

πŸ’‘ idea: Support Anthropic Prompt Caching (cache_control) to reduce token costs by ~90%Β #582

@ynott

Description

@ynott

Description

When the Mattermost Agents plugin calls the Anthropic Claude API, it sends the entire conversation history as the prompt on every request. For active threads, this results in extremely large prompts being sent repeatedly without leveraging Anthropic's Prompt Caching feature.

The Problem

We analyzed our production Claude API usage logs and found the following pattern for a single active user in one day (March 30, 2026):

  • 113 POST /api/v4/posts requests triggered Claude API calls
  • Each call sent 2 requests to Claude: a small pre-processing request (max_tokens: 25) and the main response request (max_tokens: 64000)
  • The main request's prompt_token_count grew from ~324,000 to ~364,000 tokens as the conversation progressed (prompt_length: ~546,000 β†’ ~617,000 chars)
  • prompt_token_count_cache_read: 0 and prompt_token_count_cache_create: 0 on every request β€” no caching is used at all
  • cache_routing_strategy: null β€” automatic cache routing is also not enabled
  • Estimated cost for this single user: ~$108/day

With prompt caching enabled, the cache read price is 90% cheaper ($0.30/MTok vs $3.00/MTok for Claude Sonnet), which would reduce this to approximately ~$11/day β€” a 90% cost reduction.

Root Cause

The plugin uses langchaingo v0.1.14 as its LLM abstraction layer. This version does not support Anthropic Prompt Caching.

There is an open feature branch in langchaingo (feature/anthropic-generic-prompt-caching) that adds a WithCacheTTL() function, but it has not been merged into main yet.

Mattermost Agents Plugin (Go)
  └── langchaingo v0.1.14 (no prompt caching support)
        └── Anthropic Claude API (called without cache_control)

Proposed Solution

  1. Upstream: Wait for or help land the langchaingo prompt caching branch into main.
  2. Dependency update: Bump langchaingo to a version that includes WithCacheTTL().
  3. Plugin changes: Add cache_control: {"type": "ephemeral"} to the messages sent to the Anthropic API. Specifically:
    • System prompt: Add cache_control to the system message so it is cached across requests.
    • Conversation history: Add a cache_control breakpoint at the end of the existing conversation history (just before the latest user message), so that only the new message is processed at full price.

Example API request structure with caching:

{
  "model": "claude-sonnet-4-6",
  "max_tokens": 64000,
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "First message..."},
    {"role": "assistant", "content": "Response 1..."},
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Previous message...",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "assistant", "content": "Previous response..."},
    {"role": "user", "content": "New message (not cached)"}
  ]
}

Quick win alternative: As a lower-effort first step, setting cache_routing_strategy: "auto" in the API request body may enable some automatic caching on Anthropic's side without any message restructuring.

Impact

Metric Without Cache With Cache
Input token cost (Sonnet) $3.00/MTok $0.30/MTok (read)
Single user (364K tokens x 113 calls/day) ~$108/day ~$11/day
100 active users (estimated) ~$3,000+/day ~$300/day

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions