-
Notifications
You must be signed in to change notification settings - Fork 43
Odds Ratio Preference Optimization (ORPO)
Odds Ratio Preference Optimization (ORPO) is a reference-model-free alignment method that unifies supervised fine-tuning and preference learning in one step . It uses an odds-ratio loss to contrast the favored vs. disfavored outputs, without needing a separate reward or reference model. ORPO effectively adds a small penalty for disfavored outputs during fine-tuning, simplifying the pipeline.
- Use case: Preference data with one chosen vs one rejected example per prompt, where you want to train both to follow instructions and align to preference simultaneously. ORPO lets you do instruction fine-tuning (SFT) and preference tuning at once.
- How it works: ORPO applies an odds-ratio penalty: for each prompt, it encourages the model’s probability of the chosen response and discourages the rejected one. It’s “monolithic” because it merges SFT and preference loss, requiring no separate RL step.
- Key CLI args: Use --train-mode orpo. You must set --beta (a temperature for the logistic component; default 0.1) which controls the strength of the preference penalty.
mlx_lm_lora.train \
--model facebook/opt-125m \
--train \
--train-mode orpo \
--data data/chat_preferences.jsonl \
--beta 0.1This runs ORPO on chat preference data. The dataset format is JSONL with "chosen" and "rejected" fields (like DPO). ORPO does not use a separate reward model, it directly contrasts the outputs.
Cited: Hong et al. (2024) introduce ORPO as a “monolithic” (reference-free) preference optimization. Their abstract notes that ORPO “eliminates the necessity for an additional preference alignment phase” and is effective across model sizes.
⚙️ MLX-LM-LORA is proudly built on top of MLX-LM and optimized exclusively for Apple Silicon.
Trained with state-of-the-art fine-tuning algorithms like LoRA, DoRA, DPO, ORPO, GRPO, and CPO.
Made with ❤️ by Gökdeniz Gülmez · Powered by MLX