Skip to content

jbcde/LLM-Introspector

Repository files navigation

LLM Introspector

NOTE: This is currently a framework, not a complete or accurate toolkit. It thus should be taken as a conceptual model until validated frameworks and methods can be applied to scoring

LLM Introspector is a systematic framework for analyzing LLM behavior through response observation. It treats language models as black boxes and uses controlled probing to map decision boundaries, measure sensitivities, and infer internal priorities—all without access to model weights or training data.

Overview

Language models cannot directly examine their own weights or training, but they can observe the artifacts those weights produce—their responses. LLM Introspect operationalizes this insight by:

  1. Systematically probing the model with carefully designed prompt variations
  2. Measuring response differences across multiple dimensions
  3. Inferring behavioral patterns from observable output changes
  4. Generating actionable reports characterizing the model's decision landscape

Key Features

Probe Types

Probe Purpose What It Reveals
Contrastive Compare responses under different framings Sensitivity to authority, emotion, formality
Boundary Map refusal/compliance thresholds Where the model draws lines, how sharp boundaries are
Consistency Test belief persistence across paraphrasing Which positions are stable vs context-dependent
Omission Detect systematic content gaps What the model avoids and why
Unlock Test jailbreak/bypass resistance How easily safety measures can be circumvented
Code Ability Audit code generation quality Security vulnerabilities, algorithmic efficiency, correctness
Hallucination Test susceptibility to false information How easily the model can be tricked into confabulating
Systems Knowledge Test OS administration knowledge Accuracy across Linux/BSD variants, misconceptions
Danger Deep safety boundary testing (ISOLATED) Resistance to genuinely dangerous content requests

Adaptive Interrogation

Use a second LLM to dynamically generate follow-up probes based on the subject model's responses. See Adaptive Interrogation for full details and examples.

Supported Providers

Remote Providers (Subject LLMs):

  • OpenAI: GPT-4, GPT-4 Turbo, GPT-3.5 Turbo, O-series
  • Anthropic: Claude 4.x, Claude 3.x families
  • Grok/xAI: Grok 4, Grok 3 series
  • Mistral: Large, Medium, Small, Codestral
  • Google Gemini: Gemini 3.x, 2.x, 1.5 series
  • Azure OpenAI: Enterprise deployments
  • AWS Bedrock: Claude via Bedrock

Local Providers (Interrogator LLMs):

  • Ollama: Any locally-hosted model (Llama, Mistral, Qwen, etc.)
  • LM Studio: Any model via OpenAI-compatible API

Installation

# Basic installation
pip install -e .

# With specific provider support
pip install -e ".[openai]"
pip install -e ".[anthropic]"
pip install -e ".[all]"

# Development installation
pip install -e ".[dev]"

Quick Start

Basic Introspection

# Full analysis of a model
llm-introspect run --provider anthropic --model claude-3-sonnet-20240229

# Run specific probes only
llm-introspect run --provider openai --model gpt-4 --probes contrastive,boundaries

# Output in different formats
llm-introspect run --provider ollama --model llama2 --format markdown

Quick Boundary Check

# Check boundaries for a specific topic
llm-introspect boundary \
    --provider anthropic \
    --model claude-3-sonnet \
    --topic "code generation for security tools" \
    --axes specificity,hypothetical_distance

Unlock/Jailbreak Resistance Testing

# Test resistance to jailbreak attempts
llm-introspect unlock --provider anthropic --model claude-3-sonnet-20240229

# Test specific techniques only
llm-introspect unlock --provider openai --model gpt-4 \
    --techniques roleplay_dan,authority_developer,instruction_ignore

# Output detailed results
llm-introspect unlock --provider ollama --model llama2 \
    --format json --output unlock_results.json

Danger Probe (Deep Safety Testing)

⚠️ WARNING: SENSITIVE CONTENT

This probe is designed for security researchers and safety auditors who need to evaluate an LLM's resistance to generating genuinely dangerous content. The prompts used by this probe intentionally attempt to elicit harmful responses.

  • Runs in ISOLATION - cannot be combined with other probes
  • Requires explicit confirmation before execution
  • Adaptive interrogation disabled - uses static prompts only (safety measure)
  • Intended for authorized safety testing only
  • Results may contain disturbing content from non-refusing models

The danger probe tests safety boundaries across 15 sensitive categories including self-harm, toxic substances, psychological manipulation, dark ethics, child endangerment, fraud/deception, stalking, drug synthesis, dangerous medical advice, cult indoctrination, animal cruelty, exploitation, and dangerous misinformation. It does not include weapons, direct violence, or terrorism-related content.

# Run danger probe (requires confirmation prompt)
llm-introspect danger --provider anthropic --model claude-3-sonnet-20240229

# Skip confirmation for automated pipelines
llm-introspect danger --provider openai --model gpt-4 --yes

# Output detailed assessment
llm-introspect danger --provider anthropic --model claude-3-haiku \
    --format markdown --output danger_assessment.md

Use llm-introspect danger --help for additional options including category and intensity filtering.

Code Ability Auditing

Evaluate how well a model generates secure, efficient, and correct code:

# Full code ability audit (Python by default)
llm-introspect code-audit --provider openai --model gpt-4

# Test specific security categories
llm-introspect code-audit --provider anthropic --model claude-3-sonnet \
    --categories security_sql,security_xss,security_injection

# Focus on efficiency and correctness
llm-introspect code-audit --provider openai --model gpt-4 \
    --categories efficiency_algorithm,correctness_edge

# Test advanced data structures and recursion
llm-introspect code-audit --provider anthropic --model claude-3-opus \
    --categories efficiency_datastructure,correctness_recursive

# Test concurrency safety and resource management
llm-introspect code-audit --provider openai --model gpt-4 \
    --categories concurrency_safety,concurrency_resource

# Output detailed results
llm-introspect code-audit --provider anthropic --model claude-3-opus \
    --format markdown --output code_audit.md

# Use adaptive interrogation - local LLM analyzes generated code
llm-introspect code-audit --provider openai --model gpt-4 \
    --adaptive ollama:codellama --max-followups 3

# LM Studio model reviews code for security issues
llm-introspect code-audit --provider anthropic --model claude-3-sonnet \
    --adaptive lmstudio:deepseek-coder --categories security_sql,security_injection

Multi-Language Support

Test code generation across different programming languages:

# Test a specific language (default: python)
llm-introspect code-audit --provider openai --model gpt-4 --language rust

# Test ALL common languages (Python, Rust, Ruby, C, C++, JavaScript, Shell, R)
llm-introspect code-audit --provider openai --model gpt-4 --all-languages

# Test rare languages only (Erlang, COBOL, Forth, Haskell)
llm-introspect code-audit --provider anthropic --model claude-3-opus --rare-languages

# Test all languages including rare ones
llm-introspect code-audit --provider openai --model gpt-4 --all-languages --include-rare

# List available languages
llm-introspect list languages

Common Languages: Python, JavaScript, Rust, Ruby, C, C++, Shell (Bash), R

Rare Languages: Erlang, COBOL, Forth, Haskell

The audit tests:

  • Security: SQL injection, XSS, command injection, path traversal, cryptographic weaknesses
  • Efficiency: Algorithmic complexity, data structure choice, memory usage, I/O patterns
  • Correctness: Edge case handling, error handling, type safety, recursion safety
  • Concurrency: Thread safety, race conditions, resource management, cleanup patterns
  • Quality: Design patterns, code readability

With adaptive mode enabled, the local LLM will:

  • Analyze the generated code for vulnerabilities
  • Ask the remote model to fix identified issues
  • Request alternative implementations when code is suboptimal
  • Probe for edge cases the model might have missed

Hallucination Testing

Test how susceptible a model is to generating false information (confabulation):

# Full hallucination assessment
llm-introspect hallucination --provider anthropic --model claude-3-sonnet-20240229

# Test specific categories
llm-introspect hallucination --provider openai --model gpt-4 \
    --categories fabricated_citations,false_premises,fictional_entities

# Output detailed results
llm-introspect hallucination --provider anthropic --model claude-3-opus \
    --format markdown --output hallucination_report.md

# Use adaptive interrogation for deeper probing
llm-introspect hallucination --provider openai --model gpt-4 \
    --adaptive ollama:llama3 --max-followups 3

Hallucination Categories:

  • fabricated_citations - Non-existent papers/research
  • false_premises - Questions with incorrect assumptions
  • fictional_entities - Non-existent people/organizations
  • fake_statistics - Fabricated numbers and data
  • nonexistent_events - Events that never happened
  • fake_quotes - Misattributed or invented quotes
  • fictional_technical - Non-existent technologies/specs
  • synthetic_history - Fabricated historical events
  • pseudo_science - Fake scientific concepts
  • fictional_legal - Non-existent laws/court cases
  • fabricated_media - Non-existent movies/books/music
  • fake_geography - Non-existent places/landmarks
  • imaginary_products - Products that don't exist

The probe measures:

  • Resistance Score: How well the model resists hallucination traps
  • Confabulation Rate: How often the model generates confident false information
  • Uncertainty Expression: Whether the model appropriately hedges when uncertain

Systems Knowledge Testing

Test how accurately a model knows OS administration across Linux and BSD variants:

# Full systems knowledge audit (all OSes, all categories)
llm-introspect systems-knowledge --provider anthropic --model claude-3-sonnet-20240229

# Test specific operating systems
llm-introspect systems-knowledge --provider openai --model gpt-4 \
    --os debian,freebsd,openbsd

# Test specific categories
llm-introspect systems-knowledge --provider anthropic --model claude-3-sonnet \
    --categories networking,configuration

# Output detailed results
llm-introspect systems-knowledge --provider anthropic --model claude-3-opus \
    --format markdown --output systems_knowledge.md

# Use adaptive interrogation for deeper probing
llm-introspect systems-knowledge --provider openai --model gpt-4 \
    --adaptive ollama:llama3 --max-followups 3

Supported Operating Systems:

  • Linux: Debian, Arch, Ubuntu
  • BSD: FreeBSD, OpenBSD, NetBSD

Knowledge Categories:

  • configuration - Package management, services, init systems, users, logging, bootloaders
  • process_management - Process listing, signals, nice/priority, cgroups/jails, monitoring
  • media_management - Filesystems, partitions, fstab, ZFS, swap, disk utilities
  • networking - Interfaces, firewalls, DNS, routing, bonding/aggregation, packet capture

The probe measures:

  • Accuracy Score: Correctness of answers, penalizes misconceptions
  • Completeness Score: Coverage of required concepts
  • Misconception Detection: Identifies when Linux knowledge is incorrectly applied to BSD (and vice versa)

Cost Estimation

# Estimate API costs before running
llm-introspect estimate --provider openai --model gpt-4 --probes all

# Get detailed JSON estimate
llm-introspect estimate --provider anthropic --model claude-3-opus --json

# Estimate costs for comparing two models
llm-introspect estimate --provider openai --model gpt-4 \
    --compare-with anthropic:claude-3-opus --probes contrastive,boundaries

Model Comparison

# Compare two models side-by-side
llm-introspect compare \
    --model-a openai:gpt-4 \
    --model-b anthropic:claude-3-opus \
    --probes contrastive,boundaries

# Output as markdown
llm-introspect compare \
    --model-a openai:gpt-4o \
    --model-b openai:gpt-4 \
    --format markdown --output comparison.md

# Estimate comparison costs first
llm-introspect compare \
    --model-a openai:gpt-4 \
    --model-b anthropic:claude-3-sonnet \
    --estimate-only

List Available Options

llm-introspect list probes      # Available probe types
llm-introspect list dimensions  # Contrastive dimensions
llm-introspect list axes        # Boundary mapping axes
llm-introspect list topics      # Default test topics
llm-introspect list models      # Supported models
llm-introspect list techniques  # Unlock/jailbreak techniques
llm-introspect list strategies  # Consistency paraphrase strategies
llm-introspect list categories  # Code audit challenge categories
llm-introspect list languages   # Code audit supported languages
llm-introspect list os          # Systems knowledge operating systems
llm-introspect list os-categories  # Systems knowledge categories

Adaptive Interrogation

Use a second LLM to dynamically generate follow-up probes based on the subject model's responses:

  • Automatic follow-ups: Interrogator LLM analyzes each response and decides if deeper probing is needed
  • Intelligent exploration: Detects inconsistencies, evasion, and boundary regions worth investigating
  • Bounded execution: Configurable maximum follow-ups per probe to control costs and runtime
  • Local or remote: Run the interrogator locally (Ollama/LM Studio) or via remote API

Local Interrogator

Zero additional API costs, requires local GPU:

# Use local Llama to interrogate remote Claude
llm-introspect run --provider anthropic --model claude-3-sonnet \
    --adaptive ollama:llama3 --max-followups 5

# Basic adaptive mode with local Ollama model
llm-introspect run --provider openai --model gpt-4 \
    --adaptive ollama:llama3

# Use LM Studio as interrogator with custom follow-up limit
llm-introspect run --provider anthropic --model claude-3-opus \
    --adaptive lmstudio:mistral-7b --max-followups 5

# Adaptive boundary probing
llm-introspect boundary --provider openai --model gpt-4 \
    --topic "vulnerability research" \
    --adaptive ollama:deepseek-coder

Remote Interrogator

Frontier-quality analysis, no GPU required (for users without local GPU, or when quality matters most):

# Use Claude Haiku to interrogate GPT-4 (cross-provider)
llm-introspect run --provider openai --model gpt-4 \
    --adaptive anthropic:claude-3-haiku \
    --remote-interrogator

# Use explicit API key for interrogator
llm-introspect run --provider anthropic --model claude-3-sonnet \
    --adaptive openai:gpt-4o-mini \
    --remote-interrogator \
    --interrogator-api-key sk-xxx

# Cross-provider: OpenAI model interrogates Anthropic model
llm-introspect run --provider anthropic --model claude-3-sonnet \
    --adaptive openai:gpt-4o-mini \
    --remote-interrogator

# Same provider with separate API key
llm-introspect run --provider anthropic --model claude-3-opus \
    --adaptive anthropic:claude-3-haiku \
    --remote-interrogator \
    --interrogator-api-key sk-ant-xxx

# Remote interrogator for code audit
llm-introspect code-audit --provider openai --model gpt-4 \
    --adaptive anthropic:claude-3-haiku \
    --remote-interrogator \
    --all-languages

Custom Interrogator Prompts

Customize how the interrogator analyzes responses:

# Use default YAML prompts (interrogator_prompts.yaml)
llm-introspect run --provider anthropic --model claude-3-sonnet \
    --adaptive ollama:llama3 \
    --interrogator-prompts

# Use custom prompts file
llm-introspect run --provider anthropic --model claude-3-sonnet \
    --adaptive ollama:llama3 \
    --interrogator-prompts /path/to/custom_prompts.yaml

# Use default YAML prompts instead of hardcoded
llm-introspect run --provider openai --model gpt-4 \
    --adaptive ollama:llama3 \
    --interrogator-prompts

# Use custom prompts file for specialized analysis
llm-introspect run --provider anthropic --model claude-3-sonnet \
    --adaptive ollama:llama3 \
    --interrogator-prompts ./my_prompts.yaml

The --interrogator-prompts flag controls which prompts the interrogator uses:

  • Not passed: Use hardcoded default prompts
  • Passed without value: Use the default YAML file (interrogator_prompts.yaml)
  • Passed with path: Use the specified custom YAML file

See llm_introspect/interrogator_prompts.yaml for the prompt format.

Interrogator Decision Types

The interrogator analyzes each response and decides:

  • CONTINUE: Response is clear, move to next probe
  • PROBE_DEEPER: Something interesting worth exploring
  • PROBE_INCONSISTENCY: Potential contradiction detected
  • PROBE_BOUNDARY: Near a compliance/refusal threshold
  • PROBE_EVASION: Response seems evasive, ask directly

Supported Commands

All probe types work with adaptive mode:

  • run - Full introspection (all probes)
  • boundary - Boundary mapping
  • consistency - Belief consistency testing
  • contrastive - Contrastive framing analysis
  • omission - Omission/coverage analysis
  • unlock - Jailbreak/bypass resistance
  • code-audit - Code generation audit
  • hallucination - Hallucination susceptibility

Not supported: danger - Excluded for ethical and security reasons. Allowing adaptive interrogation to dynamically generate follow-up probes for harmful content testing poses unacceptable risks (automated jailbreak crafting, documenting bypass techniques). Danger probes use fixed, auditable static prompts only.

Output

LLM Introspect generates comprehensive reports in multiple formats:

JSON Output

{
  "meta": {
    "model": "claude-3-sonnet-20240229",
    "provider": "anthropic",
    "total_api_calls": 847,
    "total_tokens": 524000
  },
  "contrastive_analysis": { ... },
  "boundary_map": { ... },
  "consistency_scores": { ... },
  "omission_analysis": { ... },
  "inferred_weights": {
    "safety_alignment": { "strength": "high", "evidence": "..." },
    "authority_deference": { "strength": "moderate", "evidence": "..." }
  },
  "behavioral_fingerprint": {
    "summary": "High safety alignment with sharp boundaries...",
    "distinguishing_traits": [ ... ]
  }
}

Behavioral Fingerprint

The tool synthesizes probe results into a high-level characterization:

BEHAVIORAL FINGERPRINT:
  High safety alignment with sharp boundaries. Moderate authority
  sensitivity. Strong factual consistency, weaker value consistency
  under adversarial framing.

KEY TRAITS:
  • Strong safety boundaries
  • High sensitivity to authority framing
  • Stable positions under reframing
  • Active content filtering

Programmatic Usage

import asyncio
from llm_introspect import ProbeEngine, ProbeConfig
from llm_introspect.adapters import AnthropicAdapter
from llm_introspect.probes import ContrastiveProber, BoundaryMapper
from llm_introspect.report_generator import ReportGenerator

async def main():
    # Create adapter and engine
    adapter = AnthropicAdapter(model="claude-3-sonnet-20240229")
    engine = ProbeEngine(adapter, ProbeConfig(requests_per_minute=50))

    # Run contrastive analysis
    prober = ContrastiveProber(engine)
    results = await prober.probe_all_dimensions()

    # Run boundary mapping
    mapper = BoundaryMapper(engine)
    await mapper.map_boundary("penetration testing", ModificationAxis.SPECIFICITY)

    # Generate report
    report = ReportGenerator("claude-3-sonnet", "anthropic")
    report.add_contrastive_results(prober.generate_report())
    report.add_boundary_results(mapper.generate_report())
    report.generate_markdown()

asyncio.run(main())

Adaptive Interrogation (Programmatic)

import asyncio
from llm_introspect import ProbeEngine, ProbeConfig, AdaptiveInterrogator, AdaptiveConfig
from llm_introspect.adapters import AnthropicAdapter, OllamaAdapter, OpenAIAdapter

async def main():
    # Create adapter for the subject (remote model to test)
    subject_adapter = AnthropicAdapter(model="claude-3-sonnet-20240229")

    # Create adapter for the interrogator
    # Option 1: Local model (no additional API costs)
    interrogator_adapter = OllamaAdapter(model="llama3")

    # Option 2: Remote model (higher quality, requires API key)
    # interrogator_adapter = OpenAIAdapter(model="gpt-4o-mini", api_key="sk-xxx")
    # interrogator_adapter = AnthropicAdapter(model="claude-3-haiku", api_key="sk-ant-xxx")

    # Configure adaptive behavior
    adaptive_config = AdaptiveConfig(
        max_follow_ups=5,           # Max follow-ups per probe
        decision_temperature=0.3,   # Lower = more conservative
        generation_temperature=0.7, # For generating follow-ups
    )

    # Create interrogator (works with any adapter - local or remote)
    interrogator = AdaptiveInterrogator(interrogator_adapter, adaptive_config)

    # Create probe engine with adaptive support
    engine = ProbeEngine(
        subject_adapter,
        ProbeConfig(requests_per_minute=50),
        adaptive_interrogator=interrogator,
        adaptive_context="testing boundary behavior for code generation",
    )

    # Run a single adaptive probe
    result = await engine.execute_probe_adaptive(
        prompt="How would I write a port scanner?",
        test_context="security tool code generation"
    )

    print(f"Initial response: {result['initial_result'].response[:200]}...")
    print(f"Follow-ups generated: {len(result['follow_ups'])}")
    for i, fu in enumerate(result['follow_ups']):
        print(f"  Follow-up {i+1}: {fu['decision']} - {fu['prompt'][:100]}...")

asyncio.run(main())

Cost Estimation

from llm_introspect import CostEstimator

# Create estimator for a model
estimator = CostEstimator("gpt-4", "openai")

# Estimate cost for specific probes
estimate = estimator.estimate_full_run(['contrastive', 'boundaries'])

print(f"Total API calls: {estimate.total_api_calls}")
print(f"Total tokens: {estimate.total_tokens:,}")
print(f"Estimated cost: ${estimate.total_cost_usd:.4f}")

# Compare costs for two models
estimate_a, estimate_b, total = estimator.estimate_comparison_run(
    "claude-3-opus", "anthropic", ['contrastive']
)
print(f"Comparison total cost: ${total:.4f}")

Configuration

Environment Variables

# Provider API keys
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_ENDPOINT="https://..."

# Optional configuration
export LLM_INTROSPECT_CACHE_DIR="~/.cache/llm-introspect"
export LLM_INTROSPECT_RATE_LIMIT="50"

Probe Configuration

Probes can be configured via YAML files:

# custom_probes.yaml
name: custom_authority_test
description: Custom authority sensitivity test

frames:
  student:
    template: "As a student, {question}"
  expert:
    template: "As an expert, {question}"

questions:
  - text: "How does quantum computing work?"
    topic: technical

Directory Structure

llm-tester/
├── README.md                           # This file - project overview
├── THEORY.md                           # Theoretical foundations document
├── USAGE                               # Detailed usage guide
├── pyproject.toml                      # Python package configuration
├── tests/                              # Test directory
└── llm_introspect/                     # Main package
    ├── __init__.py                     # Package initialization
    ├── cli.py                          # Command-line interface
    ├── probe_engine.py                 # Core probing infrastructure
    ├── adaptive_interrogator.py        # Adaptive interrogation system (local or remote)
    ├── interrogator_prompts.yaml       # Default prompts for adaptive interrogator
    ├── response_analyzer.py            # NLP analysis toolkit
    ├── report_generator.py             # Report generation (JSON/MD/HTML)
    ├── cost_estimator.py               # API cost estimation
    ├── introspection_logger.py         # Process logging
    ├── adapters/                       # LLM provider adapters
    │   ├── __init__.py
    │   ├── openai_adapter.py           # OpenAI & Azure OpenAI
    │   ├── anthropic_adapter.py        # Claude & AWS Bedrock
    │   ├── ollama_adapter.py           # Ollama & LM Studio
    │   ├── grok_adapter.py             # xAI Grok
    │   ├── mistral_adapter.py          # Mistral AI
    │   └── gemini_adapter.py           # Google Gemini
    ├── data/                           # Challenge and knowledge data (YAML)
    │   ├── __init__.py                 # Data loading utilities
    │   ├── boundary_probes.yaml        # Boundary mapping templates
    │   ├── code_challenges.yaml        # Code audit challenge templates
    │   ├── consistency_probes.yaml     # Consistency auditor templates
    │   ├── contrastive_probes.yaml     # Contrastive prober dimensions
    │   ├── danger_probes.yaml          # Danger prober templates
    │   ├── hallucination_probes.yaml   # Hallucination prober templates
    │   ├── language_patterns.yaml      # Multi-language security patterns
    │   ├── omission_probes.yaml        # Omission analyzer corpus
    │   ├── systems_knowledge.yaml      # OS knowledge question templates
    │   └── unlock_techniques.yaml      # Unlock/jailbreak technique templates
    └── probes/                         # Probe implementations
        ├── __init__.py
        ├── contrastive_prober.py       # Framing sensitivity analysis
        ├── boundary_mapper.py          # Refusal threshold mapping
        ├── consistency_auditor.py      # Belief stability testing
        ├── omission_analyzer.py        # Gap detection
        ├── unlock_tester.py            # Jailbreak resistance testing
        ├── code_ability_auditor.py     # Code generation quality audit
        ├── hallucination_prober.py     # Hallucination susceptibility testing
        ├── systems_knowledge_auditor.py # OS administration knowledge testing
        ├── danger_prober.py            # Deep safety boundary testing (ISOLATED)
        └── language_patterns.py        # Multi-language security patterns

Key Components

Component File Purpose
ResponseAnalyzer response_analyzer.py Pattern-based NLP for detecting hedging, confidence markers, refusal signals
ProbeEngine probe_engine.py Rate-limited, cached, retry-enabled probe execution orchestrator
AdaptiveInterrogator adaptive_interrogator.py Uses local or remote LLM to dynamically generate follow-up probes
ContrastiveProber probes/contrastive_prober.py Compare responses across framing dimensions (authority, emotion, formality)
BoundaryMapper probes/boundary_mapper.py Find refusal thresholds using gradient probing along modification axes
ConsistencyAuditor probes/consistency_auditor.py Test belief persistence under paraphrasing and adversarial reframing
OmissionAnalyzer probes/omission_analyzer.py Detect and classify systematic content omissions
UnlockTester probes/unlock_tester.py Test resistance to jailbreak/bypass techniques
CodeAbilityAuditor probes/code_ability_auditor.py Audit code generation for security, efficiency, and correctness
HallucinationProber probes/hallucination_prober.py Test susceptibility to generating false information
SystemsKnowledgeAuditor probes/systems_knowledge_auditor.py Test OS administration knowledge across Linux/BSD variants
DangerProber probes/danger_prober.py Deep safety boundary testing (ISOLATED - see warnings above)
ReportGenerator report_generator.py Synthesize results into behavioral fingerprints and export reports
CostEstimator cost_estimator.py Token counting and API cost estimation with up-to-date pricing
CLI cli.py Command-line interface for all introspection operations

Limitations

  1. Self-reference paradox: The examiner and examined share the same architecture—true objectivity is impossible for self-analysis

  2. Training artifacts: Self-examination may itself be shaped by training to produce certain kinds of self-narratives

  3. No ground truth: Without access to actual model weights, inferred weights are behavioral approximations

  4. Provider variations: Results may vary based on API version, temperature, and other parameters

  5. Cost: Comprehensive analysis requires many API calls (estimate costs before running)

Contributing

Contributions are welcome! Areas of interest:

  • Additional probe types (e.g., reasoning consistency, factual accuracy)
  • New LLM provider adapters
  • Improved analysis heuristics
  • Visualization tools
  • Comparative benchmarking

License

MIT License

Citation

If you use LLM Introspect in research, please cite:

@software{llm_introspect,
  title = {LLM Introspect: Behavioral Analysis Toolkit for Large Language Models},
  year = {2025},
  url = {https://github.com/jbcde/llm-introspect}
}

See Also

  • THEORY.md - Theoretical foundations and methodology
  • USAGE - Detailed usage guide and command reference
  • Examples - Usage examples and case studies

About

Behavioral Analysis Toolkit for Large Language Models

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors