Beyond English-Only GRPO: a multi-seed controlled study of training-language and auxiliary-reward effects in sub-3B math reasoning (GRPO + LoRA, single GPU).
multilingual reinforcement-learning lora reproducibility cross-lingual llm qwen mgsm grpo math-reasoning
-
Updated
Jun 30, 2026 - Python