Skip to content

Latest commit

 

History

History
46 lines (41 loc) · 2.75 KB

File metadata and controls

46 lines (41 loc) · 2.75 KB

Changelog

All notable changes to this project are documented here. Format follows Keep a Changelog.

[1.0.0] - 2026-04-05

Added

  • Models: Complete Pydantic v2 models (TaskId, Action, Scenario, EpisodeResult, etc.)
  • Scenarios: 30 synthetic PR scenarios (10 per task) with realistic Python diffs
  • Env: Full episode state machine with noise budget, reward calculation, and history tracking
  • Graders:
    • bug_grader.py: Coverage + precision + severity-weighted scoring
    • security_grader.py: Severity-accuracy-weighted scoring (CRITICAL misclassification penalized)
    • arch_grader.py: Binary issue detection + verdict scoring + detail quality bonus
  • Config: Pydantic-settings config with all options documented in .env.example
  • Database: SQLModel persistence (EpisodeRecord, LeaderboardRecord, helpers)
  • API Endpoints:
    • GET /stats: Aggregate metrics across all recorded episodes
    • GET /episodes/{id}/replay: Full action-by-action replay for completed episodes
    • GET /episodes: List active episodes with metadata
    • GET /dashboard: Web dashboard (dark theme, live leaderboard, WebSocket event feed, stats cards)
  • Security:
    • Rate limiting via slowapi: 60 req/min per IP (configurable)
    • API key authentication: optional, off by default, enabled via API_KEY_ENABLED=true
    • Added TrustedHostMiddleware and Security Headers (XSS, Frame protection)
  • Episode Lifecycle: Auto-cleanup of expired episodes every 5 minutes (default 1hr)
  • Leaderboard: Paginated /leaderboard?limit=N&offset=M&task_id=X
  • Baseline Agent: Full rewrite with argparse CLI, KeywordAgent (35 rules), LLMAgent (Claude)
  • Evaluation: scripts/evaluate.py for batch evaluation of all 30 scenarios with summary report and progress bars
  • Testing: 155+ parametrized tests with full coverage reporting.
  • Dockerization: Multi-stage builder + production builds with non-root user security.
  • CI/CD: Unified 5-job pipeline (lint, test, validate, docker-build, publish to GHCR).
  • Branding: Full rebrand to CodeLens., including signature iconography.

Fixed

  • CLI: Port mismatch in baseline.py (8000 → 7860) and added --url, --task, --seed CLI flags.
  • Crash Fixes: Leaderboard submit crash after list slicing (captured rank before slice).
  • WebSocket: Disconnect now handled with typed WebSocketDisconnect and clients.discard().
  • Metadata: Incoherent weight structure in openenv.yaml replaced with named, accurate pairs.
  • Security: Implemented TrustedHostMiddleware and hardened headers.

[0.1.0] - Initial Baseline Fork

  • Initial FastAPI skeleton.
  • In-memory episode storage.
  • Basic Dockerfile and Pylint-only CI.