Skip to content

feat(conversation): add conversation package, per-turn metrics, and runner#2

Open
sunilgattupalle wants to merge 11 commits intomainfrom
feat/multi-turn-gaps
Open

feat(conversation): add conversation package, per-turn metrics, and runner#2
sunilgattupalle wants to merge 11 commits intomainfrom
feat/multi-turn-gaps

Conversation

@sunilgattupalle
Copy link
Copy Markdown
Collaborator

@sunilgattupalle sunilgattupalle commented May 9, 2026

Summary

Closes the remaining multi-turn evaluation gaps identified against DeepEval's feature set.

Changes

Message operational fields (src/harness_evals/core/types.py)

  • Added optional latency_ms: float, token_count: int, cost_usd: float fields to carry per-turn observability data through the message history
  • Fully backward-compatible — all fields default to None, to_dict()/from_dict() handle missing keys

TurnLatencyMetric / TurnTokenCostMetric (src/harness_evals/metrics/operational/)

  • Score per-turn latency and token usage from Message.latency_ms / Message.token_count on assistant turns
  • Scoring: max(0, 1 - value / budget) per turn, mean-aggregated; negative values skipped
  • Metadata includes turn_latencies/turn_token_counts, mean_latency_ms/mean_token_count, n_turns_scored

ConversationSynthesizer / ScriptedConversationSynthesizer (src/harness_evals/synthesizer/conversation.py)

  • Two new synthesizers generating ConversationGolden datasets from source documents
  • ConversationSynthesizer (task_type="conversation") — produces scenario + expected_outcome + user_persona (no turns), intended for use with ConversationSimulator
  • ScriptedConversationSynthesizer (task_type="conversation_scripted") — produces fully scripted turns: list[Message] for direct replay
  • Both registered in the Synthesizer facade; deduplication by scenario name

Unified evaluate_dataset() (src/harness_evals/core/runner.py)

  • Extended to accept list[Golden] | list[ConversationGolden] with a new simulator_llm kwarg
  • ConversationGolden inputs route through ConversationSimulator.simulate_batch(); pre-scripted turns bypass simulation (replay mode)
  • Mixed lists raise TypeError; missing simulator_llm for conversation inputs raises ValueError
  • Existing single-turn callers are unaffected

ConversationalGEvalMetric (src/harness_evals/metrics/conversation/conversational_geval.py)

  • LLM judge that scores each assistant turn individually against a user-defined criterion
  • Supports MultiTurnView.FULL_CONVERSATION (all messages) and MultiTurnView.SLIDING_WINDOW (last N turns)
  • Returns mean score with per-turn breakdown in metadata["turn_scores"]

Per-turn scoring backfill (coherence.py, turn_relevancy.py, knowledge_retention.py)

  • Opted into _per_turn=True on LLMConversationMetric — these metrics now also store metadata["turn_scores"] alongside the aggregate score

Test plan

  • 907 tests pass locally (Python 3.11), 4 skipped
  • ruff check + ruff format --check clean across all 220 files
  • Public API verified: ConversationSynthesizer, ScriptedConversationSynthesizer, TurnLatencyMetric, TurnTokenCostMetric, ConversationalGEvalMetric, evaluate_dataset(simulator_llm=...)
  • CI matrix: Python 3.10 / 3.11 / 3.12

🤖 Generated with Claude Code

Add latency_ms, token_count, and cost_usd optional fields to the Message
dataclass with None-omitting to_dict serialization and backward-compatible
from_dict deserialization. Five new tests verify default values, setting,
serialization, roundtrip, and backward compatibility.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
…turn operational scoring

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
…test coverage

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
…onSynthesizer

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
…ith simulator_llm

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
AI-Session-Id: 4fda9174-9877-429d-9b01-af7a3dad6d46
AI-Tool: claude-code
AI-Model: unknown
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@sunilgattupalle sunilgattupalle changed the title multi-turn evals implemantation feat(conversation): add conversation package, per-turn metrics, and runner May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants