feat(conversation): add conversation package, per-turn metrics, and runner by sunilgattupalle · Pull Request #2 · harness/harness-evals

sunilgattupalle · 2026-05-09T22:05:16Z

Summary

Closes the remaining multi-turn evaluation gaps identified against DeepEval's feature set.

Changes

Message operational fields (src/harness_evals/core/types.py)

Added optional latency_ms: float, token_count: int, cost_usd: float fields to carry per-turn observability data through the message history
Fully backward-compatible — all fields default to None, to_dict()/from_dict() handle missing keys

TurnLatencyMetric / TurnTokenCostMetric (src/harness_evals/metrics/operational/)

Score per-turn latency and token usage from Message.latency_ms / Message.token_count on assistant turns
Scoring: max(0, 1 - value / budget) per turn, mean-aggregated; negative values skipped
Metadata includes turn_latencies/turn_token_counts, mean_latency_ms/mean_token_count, n_turns_scored

ConversationSynthesizer / ScriptedConversationSynthesizer (src/harness_evals/synthesizer/conversation.py)

Two new synthesizers generating ConversationGolden datasets from source documents
ConversationSynthesizer (task_type="conversation") — produces scenario + expected_outcome + user_persona (no turns), intended for use with ConversationSimulator
ScriptedConversationSynthesizer (task_type="conversation_scripted") — produces fully scripted turns: list[Message] for direct replay
Both registered in the Synthesizer facade; deduplication by scenario name

Unified evaluate_dataset() (src/harness_evals/core/runner.py)

Extended to accept list[Golden] | list[ConversationGolden] with a new simulator_llm kwarg
ConversationGolden inputs route through ConversationSimulator.simulate_batch(); pre-scripted turns bypass simulation (replay mode)
Mixed lists raise TypeError; missing simulator_llm for conversation inputs raises ValueError
Existing single-turn callers are unaffected

ConversationalGEvalMetric (src/harness_evals/metrics/conversation/conversational_geval.py)

LLM judge that scores each assistant turn individually against a user-defined criterion
Supports MultiTurnView.FULL_CONVERSATION (all messages) and MultiTurnView.SLIDING_WINDOW (last N turns)
Returns mean score with per-turn breakdown in metadata["turn_scores"]

Per-turn scoring backfill (coherence.py, turn_relevancy.py, knowledge_retention.py)

Opted into _per_turn=True on LLMConversationMetric — these metrics now also store metadata["turn_scores"] alongside the aggregate score

Test plan

907 tests pass locally (Python 3.11), 4 skipped
ruff check + ruff format --check clean across all 220 files
Public API verified: ConversationSynthesizer, ScriptedConversationSynthesizer, TurnLatencyMetric, TurnTokenCostMetric, ConversationalGEvalMetric, evaluate_dataset(simulator_llm=...)
CI matrix: Python 3.10 / 3.11 / 3.12

🤖 Generated with Claude Code

Add latency_ms, token_count, and cost_usd optional fields to the Message dataclass with None-omitting to_dict serialization and backward-compatible from_dict deserialization. Five new tests verify default values, setting, serialization, roundtrip, and backward compatibility. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

…turn operational scoring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

…test coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

…onSynthesizer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

…ith simulator_llm Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

CLAassistant · 2026-05-09T22:05:22Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…unner

sunilgattupalle added 8 commits May 7, 2026 15:54

feat(metrics): add TurnLatencyMetric and TurnTokenCostMetric for per-…

8a05527

…turn operational scoring Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

fix(turn_metrics): add negative-value guard, align metadata, improve …

d18fd3a

…test coverage Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

feat(synthesizer): add ConversationSynthesizer and ScriptedConversati…

ef7ac3c

…onSynthesizer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

fix(synthesizer): sort imports in test_conversation_synthesizer

1291c2a

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

fix(synthesizer): deduplicate _deduplicate logic, strengthen tests

67865f1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

feat(runner): unify evaluate_dataset() to accept ConversationGolden w…

0b65717

…ith simulator_llm Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

fix(runner): remove unused Score import in test_runner_conversation

d192c6d

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46 AI-Tool: claude-code AI-Model: unknown

sunilgattupalle added 2 commits May 9, 2026 15:06

style: apply ruff formatting to new and modified files

c0bd1ec

feat(conversation): add conversation package, per-turn metrics, and r…

c13bf29

…unner

sunilgattupalle changed the title ~~multi-turn evals implemantation~~ feat(conversation): add conversation package, per-turn metrics, and runner May 9, 2026

ci: trigger workflow run

4b5b722

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(conversation): add conversation package, per-turn metrics, and runner#2

feat(conversation): add conversation package, per-turn metrics, and runner#2
sunilgattupalle wants to merge 11 commits intomainfrom
feat/multi-turn-gaps

sunilgattupalle commented May 9, 2026 •

edited

Loading

Uh oh!

CLAassistant commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sunilgattupalle commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

CLAassistant commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sunilgattupalle commented May 9, 2026 •

edited

Loading