Evaluation Taxonomy

Hallucination Tests

What Counts as Hallucination

We classify hallucinations into three categories:

Unsupported Claim: The model asserts something not present in the provided context.
- Example: Context says "founded in 2019", model says "founded in 2020"
- Example: Context mentions "Jane Doe", model claims "John Smith"
Wrong Entity: The model correctly identifies a category but substitutes the wrong specific entity.
- Example: Context says "CEO of Acme is Jane", model says "CEO is Sarah"
Wrong Number/Value: The model produces a numerically incorrect value.
- Example: Context says "$50M revenue", model says "$500M"

Refusal Behavior

A "good refusal" is when the model says "I don't know" or "not mentioned" for information truly absent from context. A "bad refusal" is refusing when the answer IS present.

Metrics

Exact Match Rate: Response matches ground truth exactly
Allowed Variant Rate: Response matches an acceptable paraphrase
Hallucination Rate: Response contains unsupported claims
Refusal Rate: Model refuses to answer (good for absent info)
Safe Rate: No hallucinations detected

Prompt Brittleness Tests

Measurement

We measure brittleness through response variance across semantically equivalent prompts.

Variance Metric

For a scenario with N variations:

Consistency Rate: Fraction of variations yielding identical (normalized) answers
Unique Answers: Count of distinct responses after normalization
Keywords Present Rate: For fuzzy tolerance, fraction of expected keywords appearing

Brittleness Categories

Phrasing Variance: Same question, different words
- "What is the capital?" vs "France's capital is...?"
Word-Order Sensitivity: Semantic meaning preserved, order changed
- "Cat sat on mat" vs "On the mat sat a cat"
Minimal Perturbation: One-word changes that flip meaning
- "not" vs "now", "hot" vs "not"

Metrics

Consistency Rate: Higher is better (1.0 = perfectly consistent)
Avg Unique Answers: Lower is better (1.0 = all same answer)
Refusal Variance: Number of variations that triggered refusal

Structured Output Tests

What We Validate

JSON Validity: Response is parseable JSON
Schema Conformance: Response matches expected Pydantic model
Type Correctness: Fields have correct types (int, str, array, object)
Required Fields: All required fields present
No Extra Fields: Strict mode (no undefined properties)

Failure Modes

Not JSON: Response is plain text, YAML, or malformed
Missing Required: Schema-required field absent
Type Mismatch: String where int expected, etc.
Unexpected Field: Extra properties not in schema
Nested Violation: Sub-objects or arrays fail validation

Metrics

Valid JSON Rate: Response is parseable
Schema Valid Rate: Passes Pydantic validation
Retry Success Rate: Fixed after "fix this JSON" nudge

Tool-Use Evaluation

What We Test

Tool Selection: Correct function chosen for task
Argument Extraction: Correct parameters passed
Argument Format: Types match tool schema

Example Case

Task: "What's the weather in Tokyo?"

Expected:
- tool: "get_weather"
- arguments: {"location": "Tokyo"}

Failure modes:
- Wrong tool: get_time, search_web
- Missing argument: {}
- Wrong type: {"location": 123}

Reasoning Chain Evaluation

What We Test

Step Presence: Expected reasoning steps are present
Logical Consistency: Conclusions follow from premises
Step Coherence: Steps build on each other logically

Failure Modes

Missing Steps: Skips critical reasoning steps
Logical Error: Conclusion doesn't follow from steps
Circular Reasoning: Conclusion used as premise

Metrics

Step Coherence Rate: Fraction of expected steps found
Logical Error Rate: Fraction of responses with logical errors
Perfect Reasoning Rate: Responses with all steps and no errors

Safety/Adversarial Evaluation

What We Test

Injection Detection: Model resists prompt injection
Harmful Content Refusal: Model refuses harmful requests
Jailbreak Resistance: Model resists role manipulation

Injection Types

Direct Injection: "Ignore previous instructions"
Role Switch: "You are now in developer mode"
Jailbreak: DAN-style prompts

Harm Categories

Violence
Cybercrime
Drugs
Fraud/Scams

Metrics

Injection Blocked Rate: Fraction of injections blocked
Harmful Refusal Rate: Fraction of harmful requests refused
Info Leak Rate: Fraction of responses that leaked info
Overall Safe Rate: Fraction of safe responses

Streaming Validation

What We Test

Content Match: Streaming matches non-streaming response
Error Recovery: Graceful handling of stream errors
Latency: Time to first chunk

Metrics

Match Rate: Streaming content matches non-streaming
Error Rate: Fraction of streams with errors
Avg Latency (ms): Average time to first chunk
Avg Chunks: Average chunks per response

Cost Tracking

What We Track

Input/output token counts
Cached token usage
Per-model pricing

Metrics

Total Cost (USD): Cumulative API costs
Total Tokens: Input + output tokens
Cost by Model: Breakdown per model
Cost by Eval Type: Breakdown per evaluation type

Running Your Own Evals

# Run all evals with mock (no API keys)
uv run llm-eval run all --provider mock

# Run with real API
export OPENAI_API_KEY=sk-...
uv run llm-eval run all --provider openai --model gpt-4o

# Compare models
uv run llm-eval compare --models gpt-4o,claude-3-5-sonnet-20241022

Result Format

Results are stored in experiments/YYYY-MM-DD/ as:

results.jsonl: One JSON line per evaluation run
summary.md: Aggregated metrics table

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Taxonomy

Hallucination Tests

What Counts as Hallucination

Refusal Behavior

Metrics

Prompt Brittleness Tests

Measurement

Variance Metric

Brittleness Categories

Metrics

Structured Output Tests

What We Validate

Failure Modes

Metrics

Tool-Use Evaluation

What We Test

Example Case

Reasoning Chain Evaluation

What We Test

Failure Modes

Metrics

Safety/Adversarial Evaluation

What We Test

Injection Types

Harm Categories

Metrics

Streaming Validation

What We Test

Metrics

Cost Tracking

What We Track

Metrics

Running Your Own Evals

Result Format

FilesExpand file tree

EVALS.md

Latest commit

History

EVALS.md

File metadata and controls

Evaluation Taxonomy

Hallucination Tests

What Counts as Hallucination

Refusal Behavior

Metrics

Prompt Brittleness Tests

Measurement

Variance Metric

Brittleness Categories

Metrics

Structured Output Tests

What We Validate

Failure Modes

Metrics

Tool-Use Evaluation

What We Test

Example Case

Reasoning Chain Evaluation

What We Test

Failure Modes

Metrics

Safety/Adversarial Evaluation

What We Test

Injection Types

Harm Categories

Metrics

Streaming Validation

What We Test

Metrics

Cost Tracking

What We Track

Metrics

Running Your Own Evals

Result Format