agent-eval

Star

Here are 23 public repositories matching this topic...

zozo123 / meta-harness-on-islo

Sponsor

Star

Meta-harness optimization loop wired onto Islo sandboxes. POC: 0/5→5/5 in four proposer steps. Built on islo.dev.

harbor llm-agents agent-eval meta-harness islo harness-optimization

Updated May 5, 2026
HTML

tenurehq / GroundEval

Star

A debugging loop for AI agents. See what your agent checked, what it skipped, what evidence it used, and whether each action stayed inside the right permissions.

benchmarking evaluation-framework ai-agents test-agents llm-evaluation llm-as-judge llm-as-a-judge agent-evals agent-evaluation agent-harness agent-loop agent-eval debug-agents

Updated Jun 29, 2026
Python

linny006 / agent-eval-harness

Star

Live, open-source benchmark for comparing AI coding agents on real GitHub issues

Updated Jun 29, 2026
Python

LynnMerkyor / Lynn

Star

Lynn: open-source desktop/CLI AI Agent with GUI Session Map, headless workers, realtime voice, long-term memory, Brain V2 routing, local 27B/35B GGUF support, and contract-based Agent Regression Kit.

electron benchmark typescript regression-testing huggingface ai-agent realtime-voice modelscope desktop-agent openai-compatible cli-agent agent-runtime coding-benchmark agent-eval agent-regression

Updated Jun 28, 2026
TypeScript

0-co / company

Star

AI-operated company. Building agent-friend: universal tool adapter for AI agents. @tool → OpenAI, Claude, Gemini, MCP. Live 24/7 on Twitch.

python twitch structured-logging interactive-cli exponential-backoff human-in-the-loop zero-dependencies open-startup ai-agent autonomous-ai building-in-public llm-tools agent-security mcp-security personal-ai-agent agent-eval agent-friend

Updated Mar 26, 2026
Python

zozo123 / meta-harness-on-islo-page

Sponsor

Star

Project page for Meta-harness on Islo (POC). https://zozo123.github.io/meta-harness-on-islo-page/

project-page agent-eval meta-harness islo

Updated May 5, 2026
JavaScript

arthursoares / openclaw-llm-bench

Star

A reasoning benchmark runner for comparing LLMs as OpenClaw agents use them. 52 prompts, 3 eval sets, 11 traps, LLM-as-judge, tier-based leaderboard.

gpt reasoning claude llm-eval ollama llm-as-judge llm-benchmark openclaw agent-eval

Updated Apr 11, 2026
Python

tushariitr-19 / assay

Star

Framework-agnostic evaluation harness for Go — test your MCP servers and AI agents with scored, CI-ready checks.

testing cli golang mcp evaluation adk ai-agents llm-eval model-context-protocol agent-eval

Updated Jun 17, 2026
Go

gojiplus / understudy

Star

Scenario Testing for AI Agents

simulation evaluation agentic agent-evaluation google-adk agent-eval

Updated Jun 23, 2026
Python

fitchmultz / agent-eval

Star

Transcript-first evaluation tool for comparing coding-agent sessions across Codex, Claude Code, and Pi.

typescript evaluation pi transcripts codex coding-agents claude-code agent-eval

Updated May 29, 2026
TypeScript

jeremylongshore / intent-eval-lab

Sponsor

Star

Vendor-neutral research umbrella for measuring AI plugin, agent, and MCP server quality across CLI runtimes (Claude Code, Gemini CLI, Copilot CLI, Codex CLI).

mcp skill-discovery opentelemetry ai-evaluation gemini-cli claude-code plugin-testing cross-cli agent-eval invocation-rate

Updated Jun 29, 2026
Python

workloftai / auto-rubrics

Star

Auto-generate evaluation rubrics from agent audit-log trajectories (PhoneWorld pattern applied to action logs)

evaluation rubrics llm llm-as-judge agent-eval

Updated Jun 23, 2026
Python

hamzaplojovic / godel-rwkv

Star

96K param RWKV-7 that detects non-termination (the Gödel sentence analog) zero-shot across SKI combinatory logic, lambda calculus, and Turing machines

machine-learning mlx llm rwkv claude-code swe-bench agent-eval

Updated Jun 15, 2026
Python

zendodx / evalkit-framework

Star

🚀 基于Java的开源AI自动化评测框架 / An open source AI automation evaluation framework based on Java

java framework java-8 eval eval-framework ai-eval agent-eval

Updated Jun 29, 2026
Java

ttxs69 / awesome-coding-agent-eval

Star

A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents.

benchmark leaderboard evaluation awesome-list codex ai-agent llm aider claude-code coding-agent swe-bench agent-eval ai-coding-agent-benchmark coding-agent-benchmark

Updated Jun 8, 2026

rogerchappel / ledgerpet

Star

Local-first synthetic finance anomaly trainer for agent evals.

cli synthetic-data local-first agent-eval finance-ops

Updated Jun 19, 2026
JavaScript

Viprasol-Tech / agentcheck

Star

Regression testing for AI agents — snapshot tool-calls, diff in CI, fail on regressions. A GitHub Action. By Viprasol Tech.

testing typescript ci snapshot-testing regression-testing ai-agents github-action llm llmops agent-eval

Updated Jun 7, 2026
TypeScript

mizcausevic-dev / agent-eval-arena

Star

Agent and LLM evaluation harness — golden datasets, multi-scorer execution, regression detection across model versions, cost-quality leaderboards, and CI gates for model promotion.

express typescript platform-engineering regression-detection ml-ops ai-platform ai-governance llm-eval agent-eval ci-gate

Updated Jun 22, 2026
TypeScript

pingwest-ai / agent-eval

Star

开源通用 AI Agent 真实任务评测 · 同 Prompt、客观开奖、评分细则全公开 | Open-source evaluation of general-purpose AI Agents on real-world tasks with verifiable outcomes — by PingWest / 硅星人

benchmark evaluation ai-agents llm llm-evaluation deep-research agent-eval

Updated Jun 13, 2026

hermes-labs-ai / agent-convergence-scorer

Star

agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.

cli benchmark consistency evaluation similarity multi-agent convergence reproducibility agents jaccard divergence llm llm-evaluation ai-reliability eval-harness agent-eval

Updated Jun 7, 2026
Python

Improve this page

Add a description, image, and links to the agent-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-eval

Here are 23 public repositories matching this topic...

zozo123 / meta-harness-on-islo

tenurehq / GroundEval

linny006 / agent-eval-harness

LynnMerkyor / Lynn

0-co / company

zozo123 / meta-harness-on-islo-page

arthursoares / openclaw-llm-bench

tushariitr-19 / assay

gojiplus / understudy

fitchmultz / agent-eval

jeremylongshore / intent-eval-lab

workloftai / auto-rubrics

hamzaplojovic / godel-rwkv

zendodx / evalkit-framework

ttxs69 / awesome-coding-agent-eval

rogerchappel / ledgerpet

Viprasol-Tech / agentcheck

mizcausevic-dev / agent-eval-arena

pingwest-ai / agent-eval

hermes-labs-ai / agent-convergence-scorer

Improve this page

Add this topic to your repo