Giving agents the ability to remember - from conversation history to long-term knowledge
Perfect for:
- Multi-turn conversations requiring context
- Personalized experiences based on past interactions
- Agents that learn from user preferences
- Long-running tasks spanning multiple sessions
- Knowledge accumulation over time
Ideal scenarios:
- Personal assistants that remember your preferences
- Customer support bots that recall previous tickets
- Code assistants that learn your coding style
- Research agents that build knowledge graphs
- Tutoring systems that track learning progress
❌ Avoid when:
- Stateless is sufficient - One-off queries don't need history
- Privacy concerns - User data retention is problematic
- Cost constraints - Memory storage and retrieval adds expense
- Simple tasks - Calculator or weather lookup doesn't need memory
- Regulatory compliance - GDPR/CCPA may prohibit persistent storage
Cost trap: Infinite memory growth = infinite storage costs. Without pruning, costs compound over time.
What it is: Current conversation/task context held in the LLM prompt.
Characteristics:
- Lives only during current session
- Limited by context window (4K-200K tokens)
- Fast access (no external lookup)
- Cleared when session ends
Implementation:
class ShortTermMemory:
def __init__(self, max_tokens=4000):
self.messages = []
self.max_tokens = max_tokens
def add(self, role, content):
self.messages.append({"role": role, "content": content})
self._truncate_if_needed()
def get_context(self):
return self.messages
def _truncate_if_needed(self):
# Keep only recent messages if exceeding limit
total_tokens = sum(len(m["content"]) // 4 for m in self.messages)
while total_tokens > self.max_tokens and len(self.messages) > 1:
self.messages.pop(0) # Remove oldest
total_tokens = sum(len(m["content"]) // 4 for m in self.messages)Use cases:
- Chatbot conversations
- Current task execution context
- Recent tool call results
What it is: Information stored externally and retrieved when relevant.
Characteristics:
- Survives across sessions
- Unlimited capacity (external database)
- Requires retrieval mechanism
- Higher latency than short-term
Storage options:
- Key-value stores (Redis, DynamoDB)
- SQL databases (PostgreSQL, MySQL)
- Document stores (MongoDB)
- Vector databases (Pinecone, Weaviate, Chroma)
Implementation:
class LongTermMemory:
def __init__(self, db_connection):
self.db = db_connection
def store(self, user_id, key, value, metadata=None):
"""Store a memory with optional metadata"""
self.db.insert({
"user_id": user_id,
"key": key,
"value": value,
"metadata": metadata or {},
"timestamp": datetime.now(),
"access_count": 0
})
def retrieve(self, user_id, key):
"""Retrieve specific memory"""
memory = self.db.query({
"user_id": user_id,
"key": key
}).first()
if memory:
# Track access for importance scoring
self.db.update(memory.id, {"access_count": memory.access_count + 1})
return memory["value"] if memory else None
def search(self, user_id, query, limit=5):
"""Search memories by query"""
results = self.db.query({
"user_id": user_id,
"value": {"$regex": query, "$options": "i"}
}).limit(limit)
return [r["value"] for r in results]Use cases:
- User preferences (theme, language, notification settings)
- Conversation history across sessions
- Facts learned about the user
- Past task outcomes
What it is: Embeddings-based retrieval for similarity search.
Characteristics:
- Retrieves based on semantic similarity, not keywords
- Requires embedding model
- Fast approximate nearest neighbor search
- Handles large knowledge bases efficiently
How it works:
1. Convert text to embedding vector
2. Store vector in vector database
3. Query with new text → find similar vectors
4. Return associated text/metadata
Implementation:
from openai import OpenAI
import chromadb
class SemanticMemory:
def __init__(self):
self.client = OpenAI()
self.chroma = chromadb.Client()
self.collection = self.chroma.create_collection("memories")
def embed(self, text):
"""Generate embedding for text"""
response = self.client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def store(self, user_id, text, metadata=None):
"""Store memory with semantic embedding"""
embedding = self.embed(text)
self.collection.add(
ids=[f"{user_id}_{datetime.now().timestamp()}"],
embeddings=[embedding],
documents=[text],
metadatas=[{
"user_id": user_id,
**(metadata or {})
}]
)
def retrieve(self, user_id, query, n_results=5):
"""Retrieve semantically similar memories"""
query_embedding = self.embed(query)
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=n_results,
where={"user_id": user_id}
)
return results["documents"][0] if results["documents"] else []Use cases:
- RAG (Retrieval-Augmented Generation)
- Knowledge base search
- Finding relevant past conversations
- Document Q&A systems
What it is: Chronological record of specific events or interactions.
Characteristics:
- Time-ordered sequence
- Rich contextual metadata
- Can replay "what happened when"
- Supports temporal queries
Implementation:
class EpisodicMemory:
def __init__(self, db):
self.db = db
def record_event(self, user_id, event_type, data):
"""Record a timestamped event"""
self.db.insert({
"user_id": user_id,
"event_type": event_type,
"data": data,
"timestamp": datetime.now(),
"session_id": get_current_session_id()
})
def get_timeline(self, user_id, start_date, end_date):
"""Get events in time range"""
return self.db.query({
"user_id": user_id,
"timestamp": {"$gte": start_date, "$lte": end_date}
}).sort("timestamp")
def get_session(self, session_id):
"""Get all events from a specific session"""
return self.db.query({
"session_id": session_id
}).sort("timestamp")Use cases:
- "What did we discuss last Tuesday?"
- Debugging agent behavior
- Audit trails
- User activity timelines
Combine short-term + long-term:
class HybridMemory:
def __init__(self):
self.short_term = ShortTermMemory(max_tokens=4000)
self.long_term = LongTermMemory(db)
self.semantic = SemanticMemory()
def add_message(self, user_id, role, content):
# Always add to short-term (current context)
self.short_term.add(role, content)
# Selectively store important messages long-term
if self._is_important(content):
self.long_term.store(user_id, f"{role}_{content}", content)
self.semantic.store(user_id, content, {"role": role})
def get_context(self, user_id, query=None):
# Start with short-term memory
context = self.short_term.get_context()
# Augment with relevant long-term memories
if query:
relevant_memories = self.semantic.retrieve(user_id, query, n_results=3)
context = self._merge_context(context, relevant_memories)
return context
def _is_important(self, content):
# Simple heuristic: longer messages or questions
return len(content) > 50 or "?" in contentOrganize by levels of abstraction:
Level 1 (Raw): Individual messages
Level 2 (Summaries): Conversation summaries
Level 3 (Facts): Extracted key facts
Level 4 (Knowledge): Structured knowledge graph
class HierarchicalMemory:
def process_conversation(self, messages):
# Level 1: Store raw messages
for msg in messages:
self.store_raw(msg)
# Level 2: Generate summary
summary = self.llm.generate(f"Summarize: {messages}")
self.store_summary(summary)
# Level 3: Extract facts
facts = self.llm.generate(f"Extract key facts from: {summary}")
self.store_facts(facts)
# Level 4: Update knowledge graph
self.update_knowledge_graph(facts)Compress old memories to save space:
def consolidate_old_memories(user_id, older_than_days=30):
"""
Consolidate old detailed memories into summaries
"""
cutoff_date = datetime.now() - timedelta(days=older_than_days)
old_memories = db.query({
"user_id": user_id,
"timestamp": {"$lt": cutoff_date},
"consolidated": False
})
# Group by week
for week_memories in group_by_week(old_memories):
# Summarize the week
summary = llm.generate(
f"Summarize these interactions:\n{week_memories}"
)
# Replace detailed memories with summary
db.insert({
"user_id": user_id,
"type": "weekly_summary",
"content": summary,
"original_count": len(week_memories),
"timestamp": week_memories[0].timestamp
})
# Mark originals as consolidated (or delete)
db.update_many(
{"id": {"$in": [m.id for m in week_memories]}},
{"consolidated": True}
)| Memory Type | Retrieval Speed | Capacity | Accuracy | Cost | Use Case |
|---|---|---|---|---|---|
| Short-term | Instant | Limited (context window) | Perfect | $ | Current conversation |
| Long-term (SQL) | Fast | Large | Exact match | $$ | User preferences |
| Semantic (Vector) | Fast | Very large | Approximate | $$$ | Knowledge search |
| Episodic | Medium | Large | Perfect | $$ | Event timelines |
Scenario: Chatbot with 1000 daily active users, avg 20 messages/day
- Storage: None (in prompt only)
- Cost: Included in LLM token cost
- Total: $0
- Storage: 1000 users × 20 msgs/day × 500 bytes = 10 MB/day = 300 MB/month
- Cost: ~$0.23/month (AWS RDS)
- Queries: ~100K/month = ~$0.10
- Total: ~$0.33/month
- Embeddings: 20K msgs/day × $0.0001/1K tokens × 100 tokens/msg = $0.20/day = $6/month
- Storage: 1M vectors × $0.096/1M/month = $0.096/month
- Queries: 20K/day × $0.20/1M queries = $0.004/day = $0.12/month
- Total: ~$6.22/month
Grand total: ~$6.55/month for full memory system
Optimization:
- Only embed important messages (reduce by 80%) → $1.50/month
- Use cheaper embedding models
- Implement memory pruning
Problem: Never delete old memories
# ❌ Unbounded growth
for message in user_messages:
memory.store(message) # ForeverSolution: Implement retention policies
# ✅ Auto-prune old memories
def prune_old_memories(user_id, max_age_days=90):
cutoff = datetime.now() - timedelta(days=max_age_days)
db.delete_many({
"user_id": user_id,
"timestamp": {"$lt": cutoff},
"importance": {"$lt": 0.5} # Keep important ones
})Problem: Retrieve irrelevant memories
# ❌ Returns random old memories
memories = db.query({"user_id": user_id}).limit(5)Solution: Score by relevance + recency
# ✅ Prioritize relevant + recent
def get_relevant_memories(user_id, query):
# Semantic similarity
semantic_matches = semantic_memory.retrieve(user_id, query)
# Recent memories
recent = db.query({
"user_id": user_id,
"timestamp": {"$gte": datetime.now() - timedelta(days=7)}
})
# Combine with scoring
scored = []
for memory in semantic_matches + recent:
score = (
memory.similarity_score * 0.6 +
memory.recency_score * 0.3 +
memory.access_count * 0.1
)
scored.append((score, memory))
return [m for _, m in sorted(scored, reverse=True)[:5]]Problem: Too many memories → exceed context limit
# ❌ Blindly include all memories
context = short_term.get() + long_term.get_all()
# 50K tokens → exceeds 32K limit!Solution: Budget token allocation
# ✅ Token-aware context building
def build_context(user_id, query, max_tokens=4000):
# Reserve tokens for each component
system_tokens = 200
query_tokens = len(query) // 4
response_buffer = 1000
available_for_memory = max_tokens - system_tokens - query_tokens - response_buffer
# Get memories in priority order
memories = get_relevant_memories(user_id, query)
# Add until token budget exhausted
context = []
tokens_used = 0
for memory in memories:
memory_tokens = len(memory.content) // 4
if tokens_used + memory_tokens <= available_for_memory:
context.append(memory)
tokens_used += memory_tokens
else:
break
return contextProblem: Storing sensitive data indefinitely
# ❌ Storing credit card numbers, passwords
memory.store(user_id, "password", user_input)Solution: Filter sensitive data
# ✅ PII detection and redaction
def safe_store(user_id, content):
# Detect sensitive patterns
if contains_pii(content):
content = redact_pii(content)
if is_sensitive_category(content):
# Don't store at all or encrypt
content = encrypt(content)
memory.store(user_id, content)
def contains_pii(text):
patterns = [
r'\d{3}-\d{2}-\d{4}', # SSN
r'\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}', # Credit card
r'password|passwd|pwd' # Password keywords
]
return any(re.search(pattern, text, re.I) for pattern in patterns)Problem: Outdated facts persist
memory.store("user_preference", "User likes Python")
# 6 months later, user now prefers Rust
# But old preference still retrievedSolution: Version facts or timestamp + TTL
def store_fact(user_id, fact_type, value):
# Invalidate old version
db.update_many(
{"user_id": user_id, "fact_type": fact_type},
{"valid": False}
)
# Store new version
db.insert({
"user_id": user_id,
"fact_type": fact_type,
"value": value,
"valid": True,
"version": get_next_version(),
"timestamp": datetime.now()
})Problem: False memories accumulate
# User says something, agent stores as fact
user: "I think Paris is in Germany"
agent: memory.store("Paris is in Germany") # ❌ Storing misinformationSolution: Fact verification before storage
def store_with_verification(user_id, claim):
# Verify against knowledge base
verified = fact_check_api.verify(claim)
if verified.confidence > 0.8:
memory.store(user_id, claim, metadata={"verified": True})
else:
# Store with low confidence flag
memory.store(user_id, claim, metadata={
"verified": False,
"confidence": verified.confidence
})Agent periodically reviews and consolidates memories:
def reflect_on_memories(user_id):
"""
Periodic reflection to extract higher-level insights
"""
recent_memories = db.query({
"user_id": user_id,
"timestamp": {"$gte": datetime.now() - timedelta(days=7)}
})
# Generate insights
insights = llm.generate(f"""
Analyze these interactions and extract key insights about the user:
{recent_memories}
What patterns do you notice?
What are their preferences?
What goals are they trying to achieve?
""")
# Store insights as meta-memory
memory.store(user_id, "weekly_insights", insights, metadata={
"type": "reflection",
"timestamp": datetime.now()
})Mimic human memory decay:
def calculate_memory_strength(memory):
"""
Memory strength decays over time but reinforced by access
"""
age_days = (datetime.now() - memory.timestamp).days
# Decay factor
decay = math.exp(-age_days / 30) # Half-life of 30 days
# Reinforcement from access
reinforcement = 1 + (memory.access_count * 0.1)
# Importance boost
importance_boost = memory.metadata.get("importance", 0.5)
strength = decay * reinforcement * importance_boost
return strength
def prune_weak_memories(user_id, threshold=0.1):
"""Remove memories below strength threshold"""
memories = db.query({"user_id": user_id})
for memory in memories:
if calculate_memory_strength(memory) < threshold:
db.delete(memory.id)# Create indexes for fast retrieval
db.create_index([
("user_id", 1),
("timestamp", -1)
])
db.create_index([
("user_id", 1),
("fact_type", 1),
("valid", 1)
])def backup_user_memories(user_id):
memories = db.query({"user_id": user_id})
backup_data = {
"user_id": user_id,
"backup_date": datetime.now().isoformat(),
"memories": [m.to_dict() for m in memories]
}
# Store in object storage (S3, GCS)
storage.upload(
f"memory_backups/{user_id}/{datetime.now().date()}.json",
json.dumps(backup_data)
)def analyze_memory_usage(user_id):
return {
"total_memories": db.count({"user_id": user_id}),
"storage_mb": calculate_storage_size(user_id),
"avg_access_count": db.aggregate([
{"$match": {"user_id": user_id}},
{"$group": {"_id": None, "avg": {"$avg": "$access_count"}}}
]),
"oldest_memory": db.query({"user_id": user_id}).sort("timestamp").first(),
"most_accessed": db.query({"user_id": user_id}).sort("access_count", -1).limit(10)
}def test_memory_retrieval():
memory = HybridMemory()
user_id = "test_user"
# Store test memories
memory.add_message(user_id, "user", "I love Python programming")
memory.add_message(user_id, "user", "My favorite color is blue")
# Test retrieval
context = memory.get_context(user_id, "What language do I like?")
assert "Python" in str(context)
assert len(context) > 0
def test_memory_consolidation():
# Create old memories
for i in range(100):
memory.store(user_id, f"message_{i}", timestamp=datetime.now() - timedelta(days=60))
# Consolidate
consolidate_old_memories(user_id, older_than_days=30)
# Verify consolidation
remaining = db.count({"user_id": user_id, "consolidated": False})
summaries = db.count({"user_id": user_id, "type": "weekly_summary"})
assert remaining < 100
assert summaries > 0- Vector Databases Comparison: Benchmark
- MemGPT: Paper - OS-inspired memory management
- Memory Networks: Paper - Neural memory architectures
- Pinecone Docs: Vector DB Guide
- Chroma: Open-source vector DB
- Need error handling? → See Error Handling
- Cost concerns? → See Cost Optimization
- Security? → See Security
- Testing? → See Testing Strategies