Skip to content

Add Direct Logit Attribution tool (#1263)#1369

Open
azrabano23 wants to merge 1 commit into
TransformerLensOrg:devfrom
azrabano23:dla-tool-1263
Open

Add Direct Logit Attribution tool (#1263)#1369
azrabano23 wants to merge 1 commit into
TransformerLensOrg:devfrom
azrabano23:dla-tool-1263

Conversation

@azrabano23
Copy link
Copy Markdown

Summary

Closes #1263.

Adds a single-call Direct Logit Attribution (DLA) tool at
transformer_lens/tools/analysis/direct_logit_attribution.py, built for the
TransformerBridge system (and working unchanged on HookedTransformer, since
both share the ActivationCache API).

DLA decomposes a model's output logit — or a logit difference between a
correct and an incorrect token — into the additive contributions of upstream
components. The tool wraps the existing ActivationCache primitives
(decompose_resid / accumulated_resid / stack_head_results / logit_attrs)
into one ergonomic entry point.

API

from transformer_lens.tools.analysis import direct_logit_attribution

result = direct_logit_attribution(
    model,                         # HookedTransformer or TransformerBridge
    "The Eiffel Tower is in the city of",
    answer_tokens=" Paris",
    incorrect_tokens=" London",    # optional: attribute the logit difference
    unit="component",              # "component" | "layer" | "head"
)
result.top(5)                      # [(label, value), ...]
  • unit="component" — embedding + each layer's attn/MLP output (decompose_resid)
  • unit="layer" — cumulative residual stream per sublayer, i.e. logit-lens (accumulated_resid)
  • unit="head" — each attention head + a remainder term (stack_head_results)
  • pos selects the position to attribute (default -1; None keeps all positions)
  • cache lets you reuse a precomputed ActivationCache instead of re-running the model

Returns a DirectLogitAttribution dataclass (attribution tensor aligned with
labels, plus a top(k) helper).

Correctness

The integration tests assert the exact DLA invariant on both
HookedTransformer and TransformerBridge (compatibility mode): a complete
decomposition reconstructs the model's real logit. DLA attributes only the
W_U-direction part of a logit, so the invariant is

sum(DLA for token) + b_U[token] == logit[token]

and for a token difference the bias terms do not generally cancel (gpt2's
folded ln_final bias makes them differ), so the test compares against
logit_diff - (b_U[correct] - b_U[incorrect]). The tests are written to fail if
the attribution is only superficially correct.

Testing

  • tests/integration/model_bridge/test_direct_logit_attribution.py — 13 tests
    (component/layer/head reconstruction on HT + Bridge, labels/shape, cache reuse,
    pos=None, top(), and argument validation). Placed in integration/ per
    tests/AGENTS.md since it loads gpt2.
  • make check-format and uv run mypy on the new module both pass.

Add transformer_lens/tools/analysis/direct_logit_attribution.py, a single-call
DLA analysis that decomposes a logit (or logit difference) into per-component,
per-layer (logit-lens), or per-head contributions. Wraps the existing
ActivationCache primitives (decompose_resid / accumulated_resid /
stack_head_results / logit_attrs) and works with both HookedTransformer and
TransformerBridge, since they share the cache API.

Returns a DirectLogitAttribution dataclass (attribution tensor + aligned
labels, plus a top(k) helper). Adds integration tests asserting the exact DLA
correctness invariant on both systems: the complete decomposition reconstructs
the model's real logit up to the unembedding bias b_U.

Closes TransformerLensOrg#1263
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant