A conversational AI agent with persistent semantic memory, a two-layer semantic cache, web search, cost tracking, and a browser chat UI — all running on Harper with Claude (via the Anthropic API or Google Cloud Vertex AI).
Live demo: agent-example.stephen-demo-org.harperfabric.com/Chat
- Chat with Claude via a REST endpoint (
POST /Agent) or the built-in browser chat UI (GET /Chat) - Semantic cache — two-layer cache catches repeated and rephrased questions before they reach Claude, returning answers instantly at zero LLM cost
- Web search — Anthropic's built-in server-side web search (
web_search_20250305, up to 5 uses per turn); no external API key required - Persistent memory — every message is embedded and stored in Harper; semantic recall surfaces relevant context from past conversations automatically
- Local embeddings —
bge-small-en-v1.5runs viaharper-fabric-embeddings(llama.cpp wrapper), entirely in-process; no embedding API key or billing - Per-response metadata — every API response includes latency, token counts, cost breakdown, web searches used, and vector context stats
- Global savings tracker — cache hits accumulate a running total of USD saved and hit count in a
Statstable, displayed live in the chat sidebar - Auto-generated REST APIs — full CRUD on
Conversation,Message, andStatstables, generated from the GraphQL schema with zero route code
User Query
│
▼
┌──────────────────────────────────────────────────────────┐
│ Harper │
│ │
│ 1. Embed user message │
│ ┌─────────────────────┐ │
│ │ EmbeddingCache │ ← normalized text → vector │
│ │ hit: ~1ms lookup │ miss: SLM generates it, │
│ │ (skip SLM) │ then stores for next time │
│ └─────────────────────┘ │
│ Local SLM: bge-small-en-v1.5 (llama.cpp, in-process)│
│ │
│ 2. Store user message + embedding │
│ 3. HNSW semantic cache check (cosine distance < 0.12) │
│ │ │ │
│ Cache HIT Cache MISS │
│ │ │ │
│ Return $0.00 Call Claude ──────────────────────┼──► Anthropic API
│ + saved $X │ │ + Web Search
│ Embed response (via cache/SLM) │◄──────────┘
│ Store in Harper │
│ │
└──────────────────────────────────────────────────────────┘
Every request is standalone. Ask once, pay for Claude. Ask again — or rephrase the same question — and Harper serves the cached answer instantly at $0. The embedding cache eliminates the SLM cost on repeated text (~2.3s on Fabric → ~1ms).
Before calling Claude, the agent searches Harper's HNSW vector index for semantically similar past questions:
tables.Message.search({
conditions: {
attribute: 'embedding',
comparator: 'lt',
value: 0.12, // cosine distance < 0.12 ≡ cosine similarity ≥ 0.88
target: userEmbedding,
},
limit: 10,
})Harper's HNSW index evaluates the distance threshold internally — no full table scan, no in-memory cosine math. When a match is found, the agent looks up the assistant reply that followed it and returns that directly. No Claude call, no tokens, no cost.
Cache hits return cost.total: 0 and include a cost.saved field showing what the call would have cost. The saved amount is added to the global Stats record (totalSaved, cacheHits).
- Node.js 22+
- Harper CLI:
npm install -g harperdb - One of:
- Anthropic API key (direct API — default)
- Google Cloud project with Vertex AI enabled (GCP Vertex AI)
No embedding API key needed — embeddings run in-process.
# Clone the repo
git clone https://github.com/stephengoldberg/agent-example-harper.git
cd agent-example-harper
# Install dependencies
npm install
# Configure environment
cp dot-env.example .env
# Edit .env — see "LLM Provider Setup" below
# Start the dev server
npm run devThis agent supports two LLM backends — the direct Anthropic API and Google Cloud Vertex AI. Set LLM_PROVIDER in your .env to choose which one to use.
The simplest path. You just need an API key from console.anthropic.com.
LLM_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...Web search is included automatically via Anthropic's server-side web_search_20250305 tool — no additional API keys required.
Run Claude through your own GCP project. Useful for enterprise environments, org-level billing, data residency, and keeping everything inside Google Cloud.
1. Enable the Vertex AI API in your GCP project:
https://console.developers.google.com/apis/api/aiplatform.googleapis.com/overview?project=YOUR_PROJECT_ID
2. Enable a Claude model in the Vertex AI Model Garden — search for "Claude" and enable the model you want.
3. Request quota — new projects start with 0 tokens/min. Go to IAM & Admin → Quotas, filter for your Claude model, and request an increase.
4. Create a service account with the Vertex AI User role, download the JSON key, and place it in the project root.
5. Configure .env:
LLM_PROVIDER=vertex
VERTEX_PROJECT_ID=my-gcp-project
VERTEX_REGION=us-east5
GOOGLE_APPLICATION_CREDENTIALS=./your-service-account-key.jsonNote: Web search is not available on Vertex AI by default (requires an org policy change). The agent automatically disables it when running on Vertex.
| Variable | Required | Default | Description |
|---|---|---|---|
LLM_PROVIDER |
No | anthropic |
anthropic or vertex |
ANTHROPIC_API_KEY |
When anthropic |
— | Anthropic API key |
VERTEX_PROJECT_ID |
When vertex |
— | GCP project ID |
VERTEX_REGION |
No | global |
Vertex AI region (e.g. us-east5, global) |
VERTEX_MODEL |
No | claude-sonnet-4-6 |
Vertex model ID |
GOOGLE_APPLICATION_CREDENTIALS |
When vertex |
— | Path to GCP service account JSON key |
CLAUDE_MODEL |
No | claude-sonnet-4-5-20250929 |
Anthropic direct API model ID |
First run: On startup,
bge-small-en-v1.5(~24 MB) is auto-downloaded into./models/. This only happens once.
The server starts at http://localhost:9926. Open http://localhost:9926/Chat in your browser.
Open the chat UI:
http://localhost:9926/Chat
Send a message via API:
curl -X POST http://localhost:9926/Agent \
-H "Content-Type: application/json" \
-d '{"message": "What is Harper?"}'Response:
{
"conversationId": "abc-123",
"message": { "role": "assistant", "content": "Harper is..." },
"meta": {
"latencyMs": 1842,
"tokens": { "input": 312, "output": 148, "total": 460 },
"cost": { "input": 0.000936, "output": 0.00222, "search": 0, "total": 0.003156 },
"webSearches": 0,
"vectorContext": { "hit": false, "count": 0, "cached": false }
}
}Continue a conversation:
curl -X POST http://localhost:9926/Agent \
-H "Content-Type: application/json" \
-d '{
"conversationId": "abc-123",
"message": "Tell me more about its vector search"
}'Ask the same question again (cache hit — free and instant):
curl -X POST http://localhost:9926/Agent \
-H "Content-Type: application/json" \
-d '{"message": "What is Harper?"}'
# meta.cost.total = 0, meta.cost.saved = 0.003156Check global savings:
curl http://localhost:9926/PublicStats/global
# { "id": "global", "totalSaved": 0.003156, "cacheHits": 1, "updatedAt": "..." }Auto-generated CRUD (from schema, no route code written):
# List all conversations
curl http://localhost:9926/Conversation
# Get messages for a conversation
curl "http://localhost:9926/Message?conversationId=abc-123"├── config.yaml # Harper app configuration (6 lines)
├── schemas/
│ └── schema.graphql # Database schema (Conversation, Message, Stats + HNSW index)
├── resources/
│ ├── Agent.js # POST /Agent (agent loop + semantic cache + web search)
│ │ # GET /PublicStats/:id (public stats endpoint)
│ └── Chat.js # GET /Chat (full browser chat UI served as HTML)
├── lib/
│ ├── config.js # Environment variable helpers
│ └── embeddings.js # Local llama.cpp embeddings via harper-fabric-embeddings
├── models/ # Auto-downloaded GGUF model (gitignored)
├── .env.example # Environment template
└── package.json
type Conversation @table @export {
id: ID @primaryKey
title: String
createdAt: String
updatedAt: String
}
type Message @table @export {
id: ID @primaryKey
conversationId: String @indexed
role: String
content: String
cost: Float
embedding: [Float] @indexed(type: "HNSW", distance: "cosine")
createdAt: String
}
type Stats @table @export {
id: ID @primaryKey
totalSaved: Float
cacheHits: Int
updatedAt: String
}@table creates the database table. @export generates the full REST CRUD API. @indexed(type: "HNSW", distance: "cosine") adds the HNSW vector index used for both semantic cache lookup and context retrieval.
# 1. Create a cluster at https://fabric.harper.fast/
# 2. Add credentials to .env
CLI_TARGET=https://your-instance.your-org.harperfabric.com:9925/
CLI_TARGET_USERNAME=your-username
CLI_TARGET_PASSWORD=your-password
# 3. Deploy
npm run deployRolling restarts and replication are handled automatically.
Public access note: To make endpoints accessible without authentication, set target.checkPermission = false inside the handler method. This is the V2 Resource API pattern (loadAsInstance = false). The V1 method allowRead() is ignored in V2 Resources and has no effect.
| Concern | Traditional Stack | Harper |
|---|---|---|
| Database | Postgres / MongoDB | Built in |
| Vector search | Pinecone / Weaviate | Built in (HNSW — one schema directive) |
| Semantic cache | Redis + custom logic | Built in (native HNSW threshold filter) |
| API server | Express / Fastify | Auto-generated from schema |
| Chat UI server | Vite / Next.js | Resource returning Response(html) |
| Embeddings | Voyage / OpenAI API | Local via harper-fabric-embeddings (24 MB, in-process) |
| Deployment | Docker + K8s + cloud | harperdb deploy . |
Key insights from building this:
- Native HNSW conditions search scales. Passing
comparator: 'lt'to Harper's vector search evaluates the distance threshold inside the index. No JS cosine math, no full scans. - Everything in one process means no network hops. Database, vector index, cache, API, and agent code share the same runtime. No Redis round-trip, no vector DB round-trip.
- The schema is the only config you need. One
@indexed(type: "HNSW", distance: "cosine")directive creates the vector index. One@exportgenerates the CRUD API. One@indexedonconversationIdcreates the secondary index. - Resources can return anything. A
Resourcesubclass can return aResponsewith any content type — JSON, HTML, plain text. The chat UI lives in the same project and deploy as the agent logic. - Local embeddings eliminate a cost center.
bge-small-en-v1.5runs in-process via llama.cpp. No per-embedding billing, no embedding service SLA to worry about.
Apache 2.0 — see LICENSE