All notable changes to this project are documented here. Format follows Keep a Changelog.
- Models: Complete Pydantic v2 models (
TaskId,Action,Scenario,EpisodeResult, etc.) - Scenarios: 30 synthetic PR scenarios (10 per task) with realistic Python diffs
- Env: Full episode state machine with noise budget, reward calculation, and history tracking
- Graders:
bug_grader.py: Coverage + precision + severity-weighted scoringsecurity_grader.py: Severity-accuracy-weighted scoring (CRITICAL misclassification penalized)arch_grader.py: Binary issue detection + verdict scoring + detail quality bonus
- Config: Pydantic-settings config with all options documented in
.env.example - Database: SQLModel persistence (
EpisodeRecord,LeaderboardRecord, helpers) - API Endpoints:
GET /stats: Aggregate metrics across all recorded episodesGET /episodes/{id}/replay: Full action-by-action replay for completed episodesGET /episodes: List active episodes with metadataGET /dashboard: Web dashboard (dark theme, live leaderboard, WebSocket event feed, stats cards)
- Security:
- Rate limiting via
slowapi: 60 req/min per IP (configurable) - API key authentication: optional, off by default, enabled via
API_KEY_ENABLED=true - Added
TrustedHostMiddlewareandSecurity Headers(XSS, Frame protection)
- Rate limiting via
- Episode Lifecycle: Auto-cleanup of expired episodes every 5 minutes (default 1hr)
- Leaderboard: Paginated
/leaderboard?limit=N&offset=M&task_id=X - Baseline Agent: Full rewrite with argparse CLI,
KeywordAgent(35 rules),LLMAgent(Claude) - Evaluation:
scripts/evaluate.pyfor batch evaluation of all 30 scenarios with summary report and progress bars - Testing: 155+ parametrized tests with full coverage reporting.
- Dockerization: Multi-stage
builder+productionbuilds with non-root user security. - CI/CD: Unified 5-job pipeline (
lint,test,validate,docker-build,publishto GHCR). - Branding: Full rebrand to CodeLens., including signature iconography.
- CLI: Port mismatch in
baseline.py(8000 → 7860) and added--url,--task,--seedCLI flags. - Crash Fixes: Leaderboard submit crash after list slicing (captured rank before slice).
- WebSocket: Disconnect now handled with typed
WebSocketDisconnectandclients.discard(). - Metadata: Incoherent weight structure in
openenv.yamlreplaced with named, accurate pairs. - Security: Implemented
TrustedHostMiddlewareand hardened headers.
- Initial FastAPI skeleton.
- In-memory episode storage.
- Basic Dockerfile and Pylint-only CI.