Description
When the Mattermost Agents plugin calls the Anthropic Claude API, it sends the entire conversation history as the prompt on every request. For active threads, this results in extremely large prompts being sent repeatedly without leveraging Anthropic's Prompt Caching feature.
The Problem
We analyzed our production Claude API usage logs and found the following pattern for a single active user in one day (March 30, 2026):
- 113 POST /api/v4/posts requests triggered Claude API calls
- Each call sent 2 requests to Claude: a small pre-processing request (
max_tokens: 25) and the main response request (max_tokens: 64000)
- The main request's
prompt_token_count grew from ~324,000 to ~364,000 tokens as the conversation progressed (prompt_length: ~546,000 β ~617,000 chars)
prompt_token_count_cache_read: 0 and prompt_token_count_cache_create: 0 on every request β no caching is used at all
cache_routing_strategy: null β automatic cache routing is also not enabled
- Estimated cost for this single user: ~$108/day
With prompt caching enabled, the cache read price is 90% cheaper ($0.30/MTok vs $3.00/MTok for Claude Sonnet), which would reduce this to approximately ~$11/day β a 90% cost reduction.
Root Cause
The plugin uses langchaingo v0.1.14 as its LLM abstraction layer. This version does not support Anthropic Prompt Caching.
There is an open feature branch in langchaingo (feature/anthropic-generic-prompt-caching) that adds a WithCacheTTL() function, but it has not been merged into main yet.
Mattermost Agents Plugin (Go)
βββ langchaingo v0.1.14 (no prompt caching support)
βββ Anthropic Claude API (called without cache_control)
Proposed Solution
- Upstream: Wait for or help land the langchaingo prompt caching branch into main.
- Dependency update: Bump langchaingo to a version that includes
WithCacheTTL().
- Plugin changes: Add
cache_control: {"type": "ephemeral"} to the messages sent to the Anthropic API. Specifically:
- System prompt: Add
cache_control to the system message so it is cached across requests.
- Conversation history: Add a
cache_control breakpoint at the end of the existing conversation history (just before the latest user message), so that only the new message is processed at full price.
Example API request structure with caching:
{
"model": "claude-sonnet-4-6",
"max_tokens": 64000,
"system": [
{
"type": "text",
"text": "You are a helpful assistant...",
"cache_control": {"type": "ephemeral"}
}
],
"messages": [
{"role": "user", "content": "First message..."},
{"role": "assistant", "content": "Response 1..."},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Previous message...",
"cache_control": {"type": "ephemeral"}
}
]
},
{"role": "assistant", "content": "Previous response..."},
{"role": "user", "content": "New message (not cached)"}
]
}
Quick win alternative: As a lower-effort first step, setting cache_routing_strategy: "auto" in the API request body may enable some automatic caching on Anthropic's side without any message restructuring.
Impact
| Metric |
Without Cache |
With Cache |
| Input token cost (Sonnet) |
$3.00/MTok |
$0.30/MTok (read) |
| Single user (364K tokens x 113 calls/day) |
~$108/day |
~$11/day |
| 100 active users (estimated) |
~$3,000+/day |
~$300/day |
Additional Context
Description
When the Mattermost Agents plugin calls the Anthropic Claude API, it sends the entire conversation history as the prompt on every request. For active threads, this results in extremely large prompts being sent repeatedly without leveraging Anthropic's Prompt Caching feature.
The Problem
We analyzed our production Claude API usage logs and found the following pattern for a single active user in one day (March 30, 2026):
max_tokens: 25) and the main response request (max_tokens: 64000)prompt_token_countgrew from ~324,000 to ~364,000 tokens as the conversation progressed (prompt_length: ~546,000 β ~617,000 chars)prompt_token_count_cache_read: 0andprompt_token_count_cache_create: 0on every request β no caching is used at allcache_routing_strategy: nullβ automatic cache routing is also not enabledWith prompt caching enabled, the cache read price is 90% cheaper ($0.30/MTok vs $3.00/MTok for Claude Sonnet), which would reduce this to approximately ~$11/day β a 90% cost reduction.
Root Cause
The plugin uses langchaingo v0.1.14 as its LLM abstraction layer. This version does not support Anthropic Prompt Caching.
There is an open feature branch in langchaingo (
feature/anthropic-generic-prompt-caching) that adds aWithCacheTTL()function, but it has not been merged into main yet.Proposed Solution
WithCacheTTL().cache_control: {"type": "ephemeral"}to the messages sent to the Anthropic API. Specifically:cache_controlto the system message so it is cached across requests.cache_controlbreakpoint at the end of the existing conversation history (just before the latest user message), so that only the new message is processed at full price.Example API request structure with caching:
{ "model": "claude-sonnet-4-6", "max_tokens": 64000, "system": [ { "type": "text", "text": "You are a helpful assistant...", "cache_control": {"type": "ephemeral"} } ], "messages": [ {"role": "user", "content": "First message..."}, {"role": "assistant", "content": "Response 1..."}, { "role": "user", "content": [ { "type": "text", "text": "Previous message...", "cache_control": {"type": "ephemeral"} } ] }, {"role": "assistant", "content": "Previous response..."}, {"role": "user", "content": "New message (not cached)"} ] }Quick win alternative: As a lower-effort first step, setting
cache_routing_strategy: "auto"in the API request body may enable some automatic caching on Anthropic's side without any message restructuring.Impact
Additional Context