tie_word_embeddings with LoRA + new special tokens: keep or untie? #3147

Xxxxsir · 2026-04-10T16:59:56Z

Xxxxsir
Apr 10, 2026

Context

When fine-tuning with LoRA and adding special tokens (e.g. a pad token via tokenizer.add_special_tokens + model.resize_token_embeddings), I encounter the following warning:

/lib/python3.11/site-packages/peft/tuners/tuners_utils.py:1222: UserWarning:
Model has `tie_word_embeddings=True` and a tied layer is part of the adapter,
but `ensure_weight_tying` is not set to True. This can lead to complications,
for example when merging the adapter or converting your model to formats other
than safetensors. Check the discussion here:
https://github.com/huggingface/peft/issues/2777

This raises a design question I'd like clarification on.

Question

When adding special tokens (particularly a pad token) and applying LoRA, which approach is preferred:

Option A: Keep tie_word_embeddings=True and set ensure_weight_tying=True to maintain the pretrained model's architecture.

Option B: Untie embeddings before fine-tuning (set tie_word_embeddings=False) so that embed_tokens and lm_head are trained independently.

If anyone has experience with this or can point me to relevant discussions, I'd really appreciate the guidance!

githubnemo · 2026-04-14T09:43:54Z

githubnemo
Apr 14, 2026
Maintainer

The warning comes from the recently integrated features for ensuring weight tying. For a model with tied embeddings like yours not setting ensure_weight_tying=True means that the adapters will be treated as separate adapters. This is fine semantically but can make problems when merging the adapter onto the model base weights (you'd add the embedding adapter AND the lm_head adapter onto the same, shared embedding matrix leading to undefined behavior). If you don't plan on merging, this is absolutely fine to do.

Regarding your question: Untying has the negative effect of doubling the needed memory for the embedding matrix. In models like Gemma which feature a very large embedding matrix this can already be the difference between running out of memory during training or not. As long as you do fine-tuning (LoRA, MiSS, trainable tokens, ...) you can emulate the behavior of untying the embeddings by setting ensure_weight_tying=False, treating only the adapters as separate while keeping the embedding tied, with the aforementioned side-effects of not being able to merge.

0 replies

rehan243 · 2026-05-01T14:07:05Z

rehan243
May 1, 2026

Option B: Untie embeddings (set tie_word_embeddings=False) is generally safer when adding new special tokens and applying LoRA. We hit this exact issue at Reallytics.ai when fine-tuning LlamaForCausalLM models with peft and adding custom tokens via tokenizer.add_special_tokens. Turns out, keeping tie_word_embeddings=True with ensure_weight_tying=True can lead to weird behavior when merging adapters or converting models. For example, we saw inconsistent results when converting our fine-tuned models to safetensors format. Untying embeddings lets you train embed_tokens and lm_head independently, which is particularly useful when you're introducing new tokens that aren't necessarily tied to the original embedding space. In our case, we were adding ~100 new special tokens and saw better performance with untied embeddings. Fwiw, we use peft v0.11.0 and transformers v4.43.3. Curious to hear if you're seeing similar performance differences with tied vs untied embeddings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tie_word_embeddings with LoRA + new special tokens: keep or untie? #3147

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

tie_word_embeddings with LoRA + new special tokens: keep or untie? #3147

Uh oh!

Xxxxsir Apr 10, 2026

Context

Question

Replies: 2 comments

Uh oh!

githubnemo Apr 14, 2026 Maintainer

Uh oh!

rehan243 May 1, 2026

Xxxxsir
Apr 10, 2026

githubnemo
Apr 14, 2026
Maintainer

rehan243
May 1, 2026