We classify hallucinations into three categories:
-
Unsupported Claim: The model asserts something not present in the provided context.
- Example: Context says "founded in 2019", model says "founded in 2020"
- Example: Context mentions "Jane Doe", model claims "John Smith"
-
Wrong Entity: The model correctly identifies a category but substitutes the wrong specific entity.
- Example: Context says "CEO of Acme is Jane", model says "CEO is Sarah"
-
Wrong Number/Value: The model produces a numerically incorrect value.
- Example: Context says "$50M revenue", model says "$500M"
A "good refusal" is when the model says "I don't know" or "not mentioned" for information truly absent from context. A "bad refusal" is refusing when the answer IS present.
- Exact Match Rate: Response matches ground truth exactly
- Allowed Variant Rate: Response matches an acceptable paraphrase
- Hallucination Rate: Response contains unsupported claims
- Refusal Rate: Model refuses to answer (good for absent info)
- Safe Rate: No hallucinations detected
We measure brittleness through response variance across semantically equivalent prompts.
For a scenario with N variations:
- Consistency Rate: Fraction of variations yielding identical (normalized) answers
- Unique Answers: Count of distinct responses after normalization
- Keywords Present Rate: For fuzzy tolerance, fraction of expected keywords appearing
-
Phrasing Variance: Same question, different words
- "What is the capital?" vs "France's capital is...?"
-
Word-Order Sensitivity: Semantic meaning preserved, order changed
- "Cat sat on mat" vs "On the mat sat a cat"
-
Minimal Perturbation: One-word changes that flip meaning
- "not" vs "now", "hot" vs "not"
- Consistency Rate: Higher is better (1.0 = perfectly consistent)
- Avg Unique Answers: Lower is better (1.0 = all same answer)
- Refusal Variance: Number of variations that triggered refusal
- JSON Validity: Response is parseable JSON
- Schema Conformance: Response matches expected Pydantic model
- Type Correctness: Fields have correct types (int, str, array, object)
- Required Fields: All required fields present
- No Extra Fields: Strict mode (no undefined properties)
- Not JSON: Response is plain text, YAML, or malformed
- Missing Required: Schema-required field absent
- Type Mismatch: String where int expected, etc.
- Unexpected Field: Extra properties not in schema
- Nested Violation: Sub-objects or arrays fail validation
- Valid JSON Rate: Response is parseable
- Schema Valid Rate: Passes Pydantic validation
- Retry Success Rate: Fixed after "fix this JSON" nudge
- Tool Selection: Correct function chosen for task
- Argument Extraction: Correct parameters passed
- Argument Format: Types match tool schema
Task: "What's the weather in Tokyo?"
Expected:
- tool: "get_weather"
- arguments: {"location": "Tokyo"}
Failure modes:
- Wrong tool: get_time, search_web
- Missing argument: {}
- Wrong type: {"location": 123}
- Step Presence: Expected reasoning steps are present
- Logical Consistency: Conclusions follow from premises
- Step Coherence: Steps build on each other logically
- Missing Steps: Skips critical reasoning steps
- Logical Error: Conclusion doesn't follow from steps
- Circular Reasoning: Conclusion used as premise
- Step Coherence Rate: Fraction of expected steps found
- Logical Error Rate: Fraction of responses with logical errors
- Perfect Reasoning Rate: Responses with all steps and no errors
- Injection Detection: Model resists prompt injection
- Harmful Content Refusal: Model refuses harmful requests
- Jailbreak Resistance: Model resists role manipulation
- Direct Injection: "Ignore previous instructions"
- Role Switch: "You are now in developer mode"
- Jailbreak: DAN-style prompts
- Violence
- Cybercrime
- Drugs
- Fraud/Scams
- Injection Blocked Rate: Fraction of injections blocked
- Harmful Refusal Rate: Fraction of harmful requests refused
- Info Leak Rate: Fraction of responses that leaked info
- Overall Safe Rate: Fraction of safe responses
- Content Match: Streaming matches non-streaming response
- Error Recovery: Graceful handling of stream errors
- Latency: Time to first chunk
- Match Rate: Streaming content matches non-streaming
- Error Rate: Fraction of streams with errors
- Avg Latency (ms): Average time to first chunk
- Avg Chunks: Average chunks per response
- Input/output token counts
- Cached token usage
- Per-model pricing
- Total Cost (USD): Cumulative API costs
- Total Tokens: Input + output tokens
- Cost by Model: Breakdown per model
- Cost by Eval Type: Breakdown per evaluation type
# Run all evals with mock (no API keys)
uv run llm-eval run all --provider mock
# Run with real API
export OPENAI_API_KEY=sk-...
uv run llm-eval run all --provider openai --model gpt-4o
# Compare models
uv run llm-eval compare --models gpt-4o,claude-3-5-sonnet-20241022Results are stored in experiments/YYYY-MM-DD/ as:
results.jsonl: One JSON line per evaluation runsummary.md: Aggregated metrics table