Add Direct Logit Attribution tool (#1263) by azrabano23 · Pull Request #1369 · TransformerLensOrg/TransformerLens

azrabano23 · 2026-06-07T04:47:32Z

Summary

Closes #1263.

Adds a single-call Direct Logit Attribution (DLA) tool at
transformer_lens/tools/analysis/direct_logit_attribution.py, built for the
TransformerBridge system (and working unchanged on HookedTransformer, since
both share the ActivationCache API).

DLA decomposes a model's output logit — or a logit difference between a
correct and an incorrect token — into the additive contributions of upstream
components. The tool wraps the existing ActivationCache primitives
(decompose_resid / accumulated_resid / stack_head_results / logit_attrs)
into one ergonomic entry point.

API

from transformer_lens.tools.analysis import direct_logit_attribution

result = direct_logit_attribution(
    model,                         # HookedTransformer or TransformerBridge
    "The Eiffel Tower is in the city of",
    answer_tokens=" Paris",
    incorrect_tokens=" London",    # optional: attribute the logit difference
    unit="component",              # "component" | "layer" | "head"
)
result.top(5)                      # [(label, value), ...]

unit="component" — embedding + each layer's attn/MLP output (decompose_resid)
unit="layer" — cumulative residual stream per sublayer, i.e. logit-lens (accumulated_resid)
unit="head" — each attention head + a remainder term (stack_head_results)
pos selects the position to attribute (default -1; None keeps all positions)
cache lets you reuse a precomputed ActivationCache instead of re-running the model

Returns a DirectLogitAttribution dataclass (attribution tensor aligned with
labels, plus a top(k) helper).

Correctness

The integration tests assert the exact DLA invariant on both
HookedTransformer and TransformerBridge (compatibility mode): a complete
decomposition reconstructs the model's real logit. DLA attributes only the
W_U-direction part of a logit, so the invariant is

sum(DLA for token) + b_U[token] == logit[token]

and for a token difference the bias terms do not generally cancel (gpt2's
folded ln_final bias makes them differ), so the test compares against
logit_diff - (b_U[correct] - b_U[incorrect]). The tests are written to fail if
the attribution is only superficially correct.

Testing

tests/integration/model_bridge/test_direct_logit_attribution.py — 13 tests
(component/layer/head reconstruction on HT + Bridge, labels/shape, cache reuse,
pos=None, top(), and argument validation). Placed in integration/ per
tests/AGENTS.md since it loads gpt2.
make check-format and uv run mypy on the new module both pass.

Add transformer_lens/tools/analysis/direct_logit_attribution.py, a single-call DLA analysis that decomposes a logit (or logit difference) into per-component, per-layer (logit-lens), or per-head contributions. Wraps the existing ActivationCache primitives (decompose_resid / accumulated_resid / stack_head_results / logit_attrs) and works with both HookedTransformer and TransformerBridge, since they share the cache API. Returns a DirectLogitAttribution dataclass (attribution tensor + aligned labels, plus a top(k) helper). Adds integration tests asserting the exact DLA correctness invariant on both systems: the complete decomposition reconstructs the model's real logit up to the unembedding bias b_U. Closes TransformerLensOrg#1263

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Direct Logit Attribution tool (#1263)#1369

Add Direct Logit Attribution tool (#1263)#1369
azrabano23 wants to merge 1 commit into
TransformerLensOrg:devfrom
azrabano23:dla-tool-1263

azrabano23 commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

azrabano23 commented Jun 7, 2026

Summary

API

Correctness

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant