Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.
-
Updated
May 5, 2026 - HTML
Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.
A debugging loop for AI agents. See what your agent checked, what it skipped, what evidence it used, and whether each action stayed inside the right permissions.
Live, open-source benchmark for comparing AI coding agents on real GitHub issues
Lynn: open-source desktop/CLI AI Agent with GUI Session Map, headless workers, realtime voice, long-term memory, Brain V2 routing, local 27B/35B GGUF support, and contract-based Agent Regression Kit.
AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.
Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/
A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.
Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.
Scenario Testing for AI Agents
Transcript-first evaluation tool for comparing coding-agent sessions across Codex, Claude Code, and Pi.
Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).
Auto-generate evaluation rubrics from agent audit-log trajectories (PhoneWorld pattern applied to action logs)
96K param RWKV-7 that detects non-termination (the Gödel sentence analog) zero-shot across SKI combinatory logic, lambda calculus, and Turing machines
🚀 基于Java的开源AI自动化评测框架 / An open source AI automation evaluation framework based on Java
A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents.
Local-first synthetic finance anomaly trainer for agent evals.
Regression testing for AI agents — snapshot tool-calls, diff in CI, fail on regressions. A GitHub Action. By Viprasol Tech.
Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.
开源通用 AI Agent 真实任务评测 · 同 Prompt、客观开奖、评分细则全公开 | Open-source evaluation of general-purpose AI Agents on real-world tasks with verifiable outcomes — by PingWest / 硅星人
agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.
Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.
To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."