You are building FastMemoryWriteEnv.
This project is important. Do not turn it into a toy benchmark.
The motive is:
Create a production-shaped streaming memory-write environment that helps improve how fast incoming data becomes useful memory while data keeps coming continuously.
The core is the memory write path.
This project tests whether an LLM can decide:
- what to store
- what to update
- what to ignore
- what to mark stale
- what to compress
- what to index now
- what to delay
under latency, storage, and indexing constraints.
The environment scores whether those decisions make future queries answerable quickly and correctly.
Do not build this as:
- generic RAG
- long-document QA
- static document retrieval
- a comparison of many hand-written rule policies
- a toy local-only demo
- a pure benchmark with no production-shaped pipeline
Do not add many rule-based strategy policies.
The main system is one LLM memory-write policy.
The main policy is:
LLMMemoryWritePolicy
It receives:
- new raw event
- current active memories
- recent event context
- latency budget
- storage budget
- indexing budget
It outputs validated structured actions:
- write_memory
- update_memory
- mark_stale
- ignore_event
- compress_memory
- index_now
- delay_index
The environment executes the actions.
The LLM policy must not directly mutate stores or call Pinecone.
Allowed baselines:
- NoMemoryBaseline
- StoreEverythingBaseline
- OraclePolicy
Do not create multiple hand-written memory strategies as the main result.
Rules are only sanity checks or bounds.
Continuous data stream
↓
Fast raw write
↓
Memory-write queue
↓
LLM memory-write policy
↓
Validated memory actions
↓
Memory store + Pinecone index
↓
Future queries arrive while data is still coming
↓
Retrieve memory + answer
↓
Environment scores speed + usefulness + freshness + storage cost
Build the architecture like a real system.
Use:
- typed Pydantic schemas
- clean module boundaries
- SQLite raw/memory stores
- Pinecone as the real retrieval backend
- in-memory fake only for unit tests
- structured logging
- latency accounting
- raw rollout logs
- reproducible evaluation scripts
Pinecone is the real backend.
Use environment variables:
- PINECONE_API_KEY
- PINECONE_INDEX_NAME
- PINECONE_CLOUD
- PINECONE_REGION
InMemoryIndex exists only for unit tests.
Do not make the real path local-only.
Use an internal abstraction:
LLMClient
Implement:
MockLLMClient
OpenAICompatibleLLMClient
The LLM policy should call the abstraction, not hardcode provider logic everywhere.
The LLM must return structured JSON. Validate all outputs with Pydantic action schemas. Add retry/repair for invalid JSON.
The dataset must simulate streaming data.
Events should arrive over time.
Queries should arrive while events are still coming.
Include:
- useful facts
- low-value noise
- duplicates
- contradictions
- stale updates
- urgent facts
- multiple entities/users
- far-apart evidence
- storage/indexing pressure
Dataset modes:
- small
- medium
- long
Small is for tests. Long is for serious runs.
The primary metric is:
time_to_useful_memory
It means:
event arrives
-> raw event written
-> useful memory written/updated
-> memory indexed
-> memory retrieved by future query
-> answer uses it correctly
Track the breakdown:
- time_to_raw_write
- time_to_memory_write
- time_to_indexed_memory
- time_to_retrieved_memory
- time_to_useful_memory
Use evidence + facts.
answer_correct = answer satisfies required gold facts
evidence_correct = cited evidence supports answer
answer_success = answer_correct AND evidence_correct
Do not use exact string match as primary scoring.
Use this structure:
fast_memory_write_env/
__init__.py
actions.py
schemas.py
state.py
env.py
stores.py
index.py
pinecone_index.py
in_memory_index.py
llm_client.py
policies.py
rewards.py
metrics.py
dataset.py
evaluator.py
config.py
Scripts:
scripts/
generate_dataset.py
run_llm_policy.py
run_eval.py
evaluate_results.py
Tests:
tests/
test_actions.py
test_dataset.py
test_env_transitions.py
test_stores.py
test_in_memory_index.py
test_llm_policy_mock.py
test_metrics.py
test_rewards.py
Implement:
- schemas.py
- actions.py
- state.py
- dataset.py
- tests for schemas/dataset
Do not implement Pinecone yet.
Implement:
- stores.py
- index.py
- pinecone_index.py
- in_memory_index.py
- env.py
- config.py
- transition tests
Pinecone is the real backend. InMemoryIndex is test fake only.
Implement:
- llm_client.py
- LLMMemoryWritePolicy
- MockLLMClient
- OpenAICompatibleLLMClient
- run_llm_policy.py
- tests using mock client
Do not add multiple rule policies.
Implement:
- rewards.py
- metrics.py
- evaluator.py
- run_eval.py
- evaluate_results.py
Focus on time-to-useful-memory.
Implement:
- README.md
- examples
- raw rollout logs
- metrics output
- eval summary
Every phase must keep tests passing.
Required command:
pytest -qEvery real run should produce:
results/raw_rollouts.jsonl
results/metrics.csv
results/eval_summary.json
No hidden state. No unvalidated LLM actions. No vague policy behavior. No toy framing.
Use:
- type hints
- Pydantic models
- dataclasses only when appropriate
- small clear modules
- deterministic seeds for dataset generation
- structured logs
- explicit error handling
Do not silently swallow failures.
Action execution should return structured results:
success
action_type
latency_ms
storage_tokens_delta
error
payload
The environment is not smart by itself.
The LLM policy decides memory actions.
The environment:
- provides events
- validates actions
- updates stores
- manages index
- serves queries
- records timings
- scores whether memory became useful fast enough
The final README should describe the project as:
A production-shaped streaming memory-write environment for evaluating how quickly and accurately an LLM policy can turn continuous incoming data into useful searchable memory.
Do not describe it as:
A RAG benchmark.
Do not describe it as:
A comparison of hand-written memory strategies.
Do not describe it as:
A simple document QA environment.