💡 idea: Support Anthropic Prompt Caching (cache_control) to reduce token costs by ~90%

## Description

When the Mattermost Agents plugin calls the Anthropic Claude API, it sends the entire conversation history as the prompt on every request. For active threads, this results in extremely large prompts being sent repeatedly without leveraging Anthropic's [Prompt Caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) feature.

### The Problem

We analyzed our production Claude API usage logs and found the following pattern for a single active user in one day (March 30, 2026):

- **113 POST /api/v4/posts** requests triggered Claude API calls
- Each call sent **2 requests** to Claude: a small pre-processing request (`max_tokens: 25`) and the main response request (`max_tokens: 64000`)
- The main request's `prompt_token_count` grew from ~324,000 to ~364,000 tokens as the conversation progressed (prompt_length: ~546,000 → ~617,000 chars)
- `prompt_token_count_cache_read: 0` and `prompt_token_count_cache_create: 0` on every request — **no caching is used at all**
- `cache_routing_strategy: null` — automatic cache routing is also not enabled
- Estimated cost for this single user: **~$108/day**

With prompt caching enabled, the cache read price is 90% cheaper ($0.30/MTok vs $3.00/MTok for Claude Sonnet), which would reduce this to approximately **~$11/day** — a 90% cost reduction.

### Root Cause

The plugin uses [langchaingo](https://github.com/tmc/langchaingo) v0.1.14 as its LLM abstraction layer. This version **does not support Anthropic Prompt Caching**.

There is an open feature branch in langchaingo ([`feature/anthropic-generic-prompt-caching`](https://github.com/tmc/langchaingo/compare/main...feature/anthropic-generic-prompt-caching)) that adds a `WithCacheTTL()` function, but it has **not been merged into main yet**.

```
Mattermost Agents Plugin (Go)
  └── langchaingo v0.1.14 (no prompt caching support)
        └── Anthropic Claude API (called without cache_control)
```

### Proposed Solution

1. **Upstream**: Wait for or help land the [langchaingo prompt caching branch](https://github.com/tmc/langchaingo/compare/main...feature/anthropic-generic-prompt-caching) into main.
2. **Dependency update**: Bump langchaingo to a version that includes `WithCacheTTL()`.
3. **Plugin changes**: Add `cache_control: {"type": "ephemeral"}` to the messages sent to the Anthropic API. Specifically:
   - **System prompt**: Add `cache_control` to the system message so it is cached across requests.
   - **Conversation history**: Add a `cache_control` breakpoint at the end of the existing conversation history (just before the latest user message), so that only the new message is processed at full price.

Example API request structure with caching:

```json
{
  "model": "claude-sonnet-4-6",
  "max_tokens": 64000,
  "system": [
    {
      "type": "text",
      "text": "You are a helpful assistant...",
      "cache_control": {"type": "ephemeral"}
    }
  ],
  "messages": [
    {"role": "user", "content": "First message..."},
    {"role": "assistant", "content": "Response 1..."},
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Previous message...",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "assistant", "content": "Previous response..."},
    {"role": "user", "content": "New message (not cached)"}
  ]
}
```

**Quick win alternative**: As a lower-effort first step, setting `cache_routing_strategy: "auto"` in the API request body may enable some automatic caching on Anthropic's side without any message restructuring.

### Impact

| Metric | Without Cache | With Cache |
|--------|---------------|------------|
| Input token cost (Sonnet) | $3.00/MTok | $0.30/MTok (read) |
| Single user (364K tokens x 113 calls/day) | ~$108/day | ~$11/day |
| 100 active users (estimated) | ~$3,000+/day | ~$300/day |

### Additional Context

- Anthropic documentation: https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- The cache has a 5-minute TTL (refreshed on each hit), which aligns well with active conversation patterns where users send messages within minutes of each other
- This would also benefit other Anthropic-compatible providers that support prompt caching
- langchaingo prompt caching branch: https://github.com/tmc/langchaingo/compare/main...feature/anthropic-generic-prompt-caching

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 idea: Support Anthropic Prompt Caching (cache_control) to reduce token costs by ~90% #582

Description

The Problem

Root Cause

Proposed Solution

Impact

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Without Cache	With Cache
Input token cost (Sonnet)	$3.00/MTok	$0.30/MTok (read)
Single user (364K tokens x 113 calls/day)	~$108/day	~$11/day
100 active users (estimated)	~$3,000+/day	~$300/day

💡 idea: Support Anthropic Prompt Caching (cache_control) to reduce token costs by ~90% #582

Description

Description

The Problem

Root Cause

Proposed Solution

Impact

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions