Skip to content

Honor the model's generation_config.eos_token_id in the transformers backend#1279

Open
waqaskhan137 wants to merge 1 commit into
huggingface:mainfrom
waqaskhan137:fix-eos-token-id-chat-models
Open

Honor the model's generation_config.eos_token_id in the transformers backend#1279
waqaskhan137 wants to merge 1 commit into
huggingface:mainfrom
waqaskhan137:fix-eos-token-id-chat-models

Conversation

@waqaskhan137

Copy link
Copy Markdown

Fixes #1278.

What

For generative tasks, _generate_padded's generation_config.update(...) sets eos_token_id=self.tokenizer.eos_token_id, overriding the terminators the model itself declares in generation_config.eos_token_id. Chat models whose turn terminator is not the tokenizer's eos never stop: they emit their turn-end token, it is ignored, and every generation runs to max_new_tokens.

Concrete case in the issue: Gemma ends chat turns with token 106 while tokenizer.eos_token is <eos> (id 1); the model declares generation_config.eos_token_id = [1, 106, 50]. With the override, an MMLU chain-of-thought task padded every generation to the cap, with 63-95% of returned tokens being token 106 repeated.

Fix

Prefer self.model.generation_config.eos_token_id when set, falling back to self.tokenizer.eos_token_id (one line, plus a comment). Models whose declared eos already equals the tokenizer's are unaffected.

Validation

Measured on google/gemma-4-E2B-it, MMLU CoT item, RTX PRO 6000, bf16 (details in #1278):

before after
tokens_generated 7168 (= cap) 2654 (stops on its own)
token-106 padding 6774 1
latency / item 145 s 55 s
extracted answer correct correct (unchanged)

ruff check and ruff format --check pass on the touched file.

…backend

For generative tasks the transformers backend overrode eos_token_id with
tokenizer.eos_token_id, discarding the terminators the model itself declares.
Chat models whose turn terminator is not the tokenizer eos (e.g. Gemma ends
turns with token 106 while tokenizer.eos is 1, generation_config declares
[1, 106, 50]) therefore never stopped and padded every generation to
max_new_tokens, wasting up to ~95% of generated tokens and corrupting
generation-length measurements.

Prefer model.generation_config.eos_token_id when set, falling back to the
tokenizer's. Measured on Gemma MMLU CoT: 7168 -> 2654 tokens, 145s -> 55s per
item, extracted answer unchanged.

Fixes huggingface#1278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant