A self-improving RAG system for legal questions against a Victorian bench book. An autonomous agent loop evaluates retrieval quality using Arize, then iteratively improves the LangGraph agent and indexing pipeline until recall targets are met.
agent/ LangGraph RAG agent (OpenSearch retriever + GPT-4o-mini)
index/ OpenSearch indexing pipeline (chunking, embedding, bulk indexing)
scripts/ralph/ Autonomous improvement loop (ralph.sh + Claude Code)
scripts/ Evaluation and re-indexing scripts
skills/ Skill definitions for the Ralph agent (PRD generation, story execution)
- Arize account — sign up at arize.com (free tier available)
- Python 3.10+ and uv
- Claude Code — install via
npm install -g @anthropic-ai/claude-code - Arize AX CLI:
Find your Space ID and API key in the Arize UI under Settings > Space Settings > API Keys.
pip install arize-ax-cli ax config set --space-id <your-space-id> --api-key <your-arize-api-key> ax config show # verify profile
- Arize Skills plugin for Claude Code:
claude /plugin marketplace add Arize-ai/arize-skills claude /plugin install arize-skills@Arize-ai-arize-skills
- OpenSearch instance with kNN enabled
- API keys — OpenAI or Anthropic, Arize, and OpenSearch credentials (see Environment Variables)
git clone <repo-url> && cd self-rag-2
# Agent
cd agent && uv sync --dev && cd ..
# Index pipeline
cd index && uv sync --dev && cd ..cp agent/.env.example agent/.env
cp index/.env.example index/.env
# Edit both .env files with your credentialsInside Claude Code, use arize-skills to download the qa split from isaacus/legal-rag-bench and upload it as an Arize dataset. This dataset is used by the self-improvement loop to evaluate retrieval recall.
cd index
python index.pyThis loads the legal-rag-bench corpus, chunks it, generates embeddings with text-embedding-3-large (1024 dims), and bulk-indexes into OpenSearch.
cd agent
langgraph devThe agent runs on port 2024.
Start Claude Code with --dangerously-skip-permissions so the autonomous agent can freely edit code and run commands:
./scripts/ralph/ralph.sh --tool claudeRalph drives the improvement loop:
- Reads a PRD (
scripts/ralph/prd.json) and picks the highest-priority failing user story - Implements changes in
agent/(retrieval logic) and/orindex/(indexing pipeline) - Runs quality checks (lint, typecheck, tests)
- Commits passing changes
- Runs an Arize experiment against the QA dataset to measure recall@1, recall@5, recall@10
- Analyzes failures and adds new improvement stories if recall@5 < 80%
- Repeats until recall targets are met or max iterations reached
A LangGraph StateGraph with two nodes:
- retrieve — kNN search against OpenSearch using
text-embedding-3-large(1024 dims) - call_model — RAG prompt answered by GPT-4o-mini
Tracing via Arize OTel + LangChainInstrumentor.
Loads the legal-rag-bench corpus, chunks with RecursiveCharacterTextSplitter, embeds with text-embedding-3-large (1024 dims), and bulk-indexes into OpenSearch with HNSW cosine similarity.
┌─────────────────────────┐
│ ralph.sh (loop driver) │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Pick next failing story │
│ from prd.json │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Implement changes in │
│ agent/ and/or index/ │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Lint, typecheck, test │
└────────────┬────────────┘
│
┌────────────▼────────────┐
│ Run Arize experiment │
│ (recall@1, @5, @10) │
└────────────┬────────────┘
│
┌───────▼───────┐
│ Recall@5 > 80%? │
└───┬───────┬───┘
no │ │ yes
┌────────▼──┐ ┌─▼──────────┐
│ Add new │ │ Done │
│ stories │ └─────────────┘
└─────┬─────┘
│
└──── (next iteration)
| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
Yes | OpenAI API key |
HOST |
Yes | OpenSearch host |
USERNAME |
Yes | OpenSearch username |
PASSWORD |
Yes | OpenSearch password |
INDEX |
Yes | OpenSearch index name |
ARIZE_SPACE_ID |
No | Arize space ID |
ARIZE_API_KEY |
No | Arize API key |
ARIZE_PROJECT_NAME |
No | Arize project name |
| Variable | Required | Description |
|---|---|---|
OPENSEARCH_HOST |
Yes | OpenSearch host URL |
OPENSEARCH_USER |
Yes | OpenSearch username |
OPENSEARCH_PASS |
Yes | OpenSearch password |
OPENAI_API_KEY |
Yes | OpenAI API key for embeddings |
# Agent
cd agent
make test # unit tests
make lint # ruff + mypy --strict
make format # auto-fix
# Index
cd index
uv sync --devBoth Python packages use Ruff (pycodestyle, pyflakes, isort, pydocstyle) and mypy --strict.